11G RAC 节点2 主机down(两个节点RAC)

--节点2 数据库日志

Mon Jul 01 06:38:22 2019
SUCCESS: diskgroup SAS_ARCH was dismounted
Mon Jul 01 06:38:22 2019
Shutting down instance (abort)
License high water mark = 1923
USER (ospid: 82381): terminating the instance
Mon Jul 01 06:38:22 2019
opiodr aborting process unknown ospid (12589) as a result of ORA-1092
Mon Jul 01 06:38:22 2019
opiodr aborting process unknown ospid (45276) as a result of ORA-1092
Mon Jul 01 06:38:22 2019
opiodr aborting process unknown ospid (107399) as a result of ORA-1092
Instance terminated by USER, pid = 82381
Mon Jul 01 06:38:24 2019
Instance shutdown complete

--主机日志

Jul 1 06:35:01 test2 auditd[16253]: Audit daemon rotating log files
Jul 1 06:38:19 test2 init: oracle-ohasd main process (15639) killed by TERM signal
Jul 1 06:38:19 test2 init: oracle-tfa main process (15638) killed by TERM signal
Jul 1 06:38:19 test2 init: tty (/dev/tty2) main process (16997) killed by TERM signal
Jul 1 06:38:19 test2 init: tty (/dev/tty3) main process (16999) killed by TERM signal
Jul 1 06:38:19 test2 init: tty (/dev/tty4) main process (17004) killed by TERM signal
Jul 1 06:38:19 test2 init: tty (/dev/tty5) main process (17006) killed by TERM signal
Jul 1 06:38:19 test2 init: tty (/dev/tty6) main process (17008) killed by TERM signal
Jul 1 06:38:19 test2 gnome-session[17110]: WARNING: Failed to send buffer
Jul 1 06:38:19 test2 gnome-session[17110]: WARNING: Failed to send buffer
Jul 1 06:38:23 test2 ntpd[90741]: Deleting interface #15 bond0:1, 10.1.11.103#123, interface stats: received=1410, sent=0, dropped=0, active_time=56169415 secs
Jul 1 06:38:39 test2 pulseaudio[17164]: pid.c: Failed to open PID file '/var/lib/gdm/.pulse/45593399e441b14e2757581a00000028-runtime/pid': No such file or directory
Jul 1 06:38:39 test2 pulseaudio[17164]: pid.c: Failed to open PID file '/var/lib/gdm/.pulse/45593399e441b14e2757581a00000028-runtime/pid': No such file or directory
Jul 1 06:38:46 test2 ntpd[90741]: Deleting interface #14 bond1:1, 169.254.7.117#123, interface stats: received=0, sent=0, dropped=0, active_time=56169467 secs
Jul 1 06:38:51 test2 abrtd: Got signal 15, exiting
Jul 1 06:38:51 test2 xinetd[45495]: Exiting...
Jul 1 06:38:51 test2 acpid: exiting
Jul 1 06:38:51 test2 ntpd[90741]: ntpd exiting on signal 15
Jul 1 06:38:53 test2 init: Disconnected from system bus
Jul 1 06:38:53 test2 rtkit-daemon[17166]: Demoting known real-time threads.
Jul 1 06:38:53 test2 rtkit-daemon[17166]: Demoted 0 threads.
Jul 1 06:38:53 test2 auditd[16253]: The audit daemon is exiting.
Jul 1 06:38:53 test2 kernel: type=1305 audit(1561934333.370:37053744): audit_pid=0 old=16253 auid=4294967295 ses=4294967295 res=1
Jul 1 06:38:53 test2 kernel: type=1305 audit(1561934333.475:37053745): audit_enabled=0 old=1 auid=4294967295 ses=4294967295 res=1
Jul 1 06:38:53 test2 kernel: Kernel logging (proc) stopped.
Jul 1 06:38:53 test2 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="16275" x-info="http://www.rsyslog.com"] exiting on signal 15.

---节点2 GRID 日志 /u01/11.2.0/grid/log/test2 下面的alertbapdb2.log
2019-07-01 06:34:49.606:
[client(75150)]CRS-0009:log file "/u01/11.2.0/grid/log/test2/client/olsnodes.log" reopened
2019-07-01 06:34:49.606:
[client(75150)]CRS-0019:file rotation terminated. log file: "/u01/11.2.0/grid/log/test2/client/olsnodes.log"
2019-07-01 06:38:33.151:
[/u01/11.2.0/grid/bin/orarootagent.bin(106660)]CRS-5822:Agent '/u01/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:52057} in /u01/11.2.0/grid/log/test2/agent/crsd/orarootagent_root//orarootagent_root.log.
LFI-01523: rename() failed.

2019-07-01 06:34:49.606:
[client(75150)]CRS-0009:log file "/u01/11.2.0/grid/log/test2/client/olsnodes.log" reopened
2019-07-01 06:34:49.606:
[client(75150)]CRS-0019:file rotation terminated. log file: "/u01/11.2.0/grid/log/test2/client/olsnodes.log"
2019-07-01 06:38:33.151:
[/u01/11.2.0/grid/bin/orarootagent.bin(106660)]CRS-5822:Agent '/u01/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:52057} in /u01/11.2.0/grid/log/test2/agent/crsd/orarootagent_root//orarootagent_root.log.
2019-07-01 06:38:33.887:
[ctssd(104917)]CRS-2405:The Cluster Time Synchronization Service on host test2 is shutdown by user
2019-07-01 06:38:33.892:
[mdnsd(103640)]CRS-5602:mDNS service stopping by request.
2019-07-01 06:38:45.860:
[cssd(103758)]CRS-1603:CSSD on node test2 shutdown by user.
2019-07-01 06:38:45.970:
[ohasd(103446)]CRS-2767:Resource state recovery not attempted for 'ora.cssdmonitor' as its target state is OFFLINE
2019-07-01 06:38:46.064:
[cssd(103758)]CRS-1660:The CSS daemon shutdown has completed
2019-07-01 06:38:49.592:
[gpnpd(103651)]CRS-2329:GPNPD on node test2 shutdown.
2019-07-01 09:28:04.022:
[ohasd(17090)]CRS-2112:The OLR service started on node test2.
2019-07-01 09:28:04.069:
[ohasd(17090)]CRS-1301:Oracle High Availability Service started on node test2.

rac是通过几个必要条件进行通信，时间，磁盘心跳，链路心跳，缺一不可。

---节点1 日志

Mon Jul 01 06:38:24 2019
Reconfiguration started (old inc 16, new inc 18)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Jul 01 06:38:25 2019
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Mon Jul 01 06:38:25 2019
LMS 3: 2 GCS shadows cancelled, 1 closed, 0 Xw survived
Mon Jul 01 06:38:25 2019
Mon Jul 01 06:38:25 2019
LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Mon Jul 01 06:38:36 2019
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Mon Jul 01 06:38:36 2019
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Mon Jul 01 06:38:52 2019
parallel recovery started with 32 processes
Started redo scan
Completed redo scan
read 12123 KB redo, 6138 data blocks need recovery
Mon Jul 01 06:38:55 2019
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Mon Jul 01 06:39:07 2019
Reconfiguration complete
Mon Jul 01 06:39:32 2019
Started redo application at
Thread 2: logseq 218275, block 1708335

---原因：
2019-07-01 06:38:33.887:
[ctssd(104917)]CRS-2405:The Cluster Time Synchronization Service on host test2 is shutdown by user

主机test2上的集群时间同步服务由用户关闭

主机 BIOS 时间不一致；

[oracle@test1 ~]$ su - root
Password:
[root@test1 ~]# hwclock
Mon 01 Jul 2019 11:27:27 AM CST -0.485777 seconds
[root@test1 ~]# date
Mon Jul 1 10:44:03 CST 2019

[root@test2 ~]# hwclock
Mon 01 Jul 2019 10:42:33 AM CST -0.219479 seconds
[root@test2 ~]# date
Mon Jul 1 10:42:36 CST 2019

--同步方式

--节点1 cat /etc/ntp.conf

server pbsntp01.sx.com iburst
server pbsntp02.sx.com iburst

--节点2 修改后 cat /etc/ntp.conf
server 10.0.10.2 iburst
#server pbsntp02.sx.com iburst

--解决办法：

hwclock -w

如果时间不方便可以按照如下定时任务修改

--root 用户
vi hwclock.sh

#! /bin/bash
#Function refresh BIOS time
exec >> /var/log/hwclock`date +%y%m%d%H`.log
date
sleep 3
echo "This is system date"
hwclock
sleep 3
echo "This is show hwclock"
hwclock -w
sleep 3
echo "This is application hwclock to BIOS"
#try agen
hwclock
sleep 3
date

[oracle@test1 ~]$ ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*1.0.10.250 19.19.24.22 3 u 19 256 377 0.436 -3.817 6.576
[oracle@test1 ~]$ ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*1.0.10.250 19.19.24.22 3 u 32 256 377 0.436 -3.817 6.576

[oracle@test2 ~]$ ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*1.0.10.250 12.2.15.2 3 u 61 256 377 0.386 2.638 7.938
[oracle@test2 ~]$ ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*1.0.10.250 12.2.15.2 3 u 62 256 377 0.386 2.638 7.938

remote:响应这个请求的NTP服务器的名称。
refid:NTP服务器使用的上一级ntp服务器。
st:remote远程服务器的级别.由于NTP是层型结构,有顶端的服务器,多层的Relay Server再到客户端.所以服务器从高到低级别可以设定为1-16.为了减缓负荷和网络堵塞,原则上应该避免直接连接到级别为1的服务器的.
when:上一次成功请求之后到现在的秒数。
poll:本地机和远程服务器多少时间进行一次同步(单位为秒).在一开始运行NTP的时候这个poll值会比较小,那样和服务器同步的频率也就增加了,可以尽快调整到正确的时间范围，之后poll值会逐渐增大,同步的频率也就会相应减小
reach:这是一个八进制值,用来测试能否和服务器连接.每成功连接一次它的值就会增加
delay:从本地机发送同步要求到ntp服务器的round trip time
offset:主机通过NTP时钟同步与所同步时间源的时间偏移量，单位为毫秒（ms）。offset越接近于0,主机和ntp服务器的时间越接近
jitter:这是一个用来做统计的值.它统计了在特定个连续的连接数里offset的分布情况.简单地说这个数值的绝对值越小，主机的时间就越精确

----重点查询 offset 这个值是否在本机一直在增长，在100 以内表示没问题

---可以添加定时任务

#ntpd
* */1 * * * /bin/sh /home/oracle/shell/ntpq.sh &> /dev/null

vi ntpq.sh

#!/bin/bash
source /home/oracle/.bash_profile
exec >> /home/oracle/shell/ntpq_`date +%y%m%d%H`.log
ntpq -p
sleep 3
ntpq -p

chmod +x ntpq.sh

---本次故障实际原因：OS 层面电源模块故障；