2014-11-27

has a disk HB, but no network HB, DHB has rcfg

11gR2 GI Node May not Join the Cluster After Private Network is Functional After Eviction due to Private Network Problem (Doc ID 1479380.1)


11gR2 Grid Infrastructure, cssd fails to join the cluster (GI fails to start as a result) after recovered from private network failure caused by pulling cluster interconnect cables etc
  • OS /var/log/messages - node1
Jul 16 16:12:54 <0.6> racnode1 kernel: e1000e: pci1p2 NIC Link is Down
Jul 16 16:12:55 <0.6> racnode1 kernel: bonding: bond0: link status definitely down for interface pci1p2, disabling it
Jul 16 16:12:55 <0.6> racnode1 kernel: bonding: bond0: now running without any active interface !                                    ##>> private network failed
..
Jul 16 16:15:00 <0.6> racnode1 kernel: bnx2 0000:02:00.1: em2: NIC Copper Link is Up, 1000 Mbps full duplex         ##>> private network recovered
Jul 16 16:15:01 <0.6> racnode1 kernel: bonding: bond0: link status definitely up for interface em2.
Jul 16 16:15:01 <0.6> racnode1 kernel: bonding: bond0: making interface em2 the new active one.
Jul 16 16:15:01 <0.6> racnode1 kernel: bonding: bond0: first active interface up!
Jul 16 16:15:15 <0.6> racnode1 kernel: e1000e: pci1p2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Jul 16 16:15:16 <0.6> racnode1 kernel: bonding: bond0: link status definitely up for interface pci1p2.
..
   
  • <GI_HOME>/log/<node>/alert<node>.log - node1
2012-07-16 16:13:09.846
[cssd(11589)]CRS-1612:Network communication with node racnode2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.190 seconds
2012-07-16 16:13:16.860
[cssd(11589)]CRS-1611:Network communication with node racnode2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 7.180 seconds
2012-07-16 16:13:21.870
[cssd(11589)]CRS-1610:Network communication with node racnode2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.170 seconds
2012-07-16 16:13:24.038
[cssd(11589)]CRS-1607:Node racnode2 is being evicted in cluster incarnation 237506449; details at (:CSSNM00007:) in /ocw/grid/log/racnode1/cssd/ocssd.log.
2012-07-16 16:13:25.020
[ohasd(11154)]CRS-8011:reboot advisory message from host: racnode2, component: cssmonit, with time stamp: L-2012-07-16-16:13:25.020
[ohasd(11154)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-07-16 16:13:25.020
[ohasd(11154)]CRS-8011:reboot advisory message from host: racnode2, component: cssagent, with time stamp: L-2012-07-16-16:13:25.020
[ohasd(11154)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-07-16 16:13:55.048
[cssd(11589)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 .
2012-07-16 16:13:55.056
[crsd(13213)]CRS-5504:Node down event reported for node 'racnode2'.
2012-07-16 16:13:57.538
[crsd(13213)]CRS-2773:Server 'racnode2' has been removed from pool 'Generic'.
..

  • <GI_HOME>/log/<node>/alert<node>.log - node2
2012-07-16 16:13:10.052
[cssd(22171)]CRS-1612:Network communication with node racnode1 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.180 seconds
2012-07-16 16:13:17.066
[cssd(22171)]CRS-1611:Network communication with node racnode1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 7.170 seconds
2012-07-16 16:13:22.076
[cssd(22171)]CRS-1610:Network communication with node racnode1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.160 seconds
2012-07-16 16:13:24.243
[cssd(22171)]CRS-1608:This node was evicted by node 1, racnode1; details at (:CSSNM00005:) in /ocw/grid/log/racnode2/cssd/ocssd.log.
2012-07-16 16:13:24.243
[cssd(22171)]CRS-1608:This node was evicted by node 1, racnode1; details at (:CSSNM00005:) in /ocw/grid/log/racnode2/cssd/ocssd.log.
2012-07-16 16:13:24.243
[cssd(22171)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /ocw/grid/log/racnode2/cssd/ocssd.log
2012-07-16 16:13:24.243
[cssd(22171)]CRS-1652:Starting clean up of CRSD resources.
2012-07-16 16:13:24.243
[cssd(22171)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /ocw/grid/log/racnode2/cssd/ocssd.log.
..
2012-07-16 16:13:24.962
[ohasd(17683)]CRS-2765:Resource 'ora.evmd' has failed on server 'racnode2'.
2012-07-16 16:13:24.969
[ohasd(17683)]CRS-2765:Resource 'ora.ctssd' has failed on server 'racnode2'.           ##>> node being evicted after private network down
2012-07-16 16:19:33.962
[ohasd(17652)]CRS-2112:The OLR service started on node racnode2.                        ##>> GI auto starting upon node restart
2012-07-16 16:19:34.069
[ohasd(17652)]CRS-1301:Oracle High Availability Service started on node racnode2.
2012-07-16 16:19:34.146
[ohasd(17652)]CRS-8011:reboot advisory message from host: racnode2, component: cssmonit, with time stamp: L-2012-07-16-16:13:25.020
[ohasd(17652)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-07-16 16:19:34.194
[ohasd(17652)]CRS-8011:reboot advisory message from host: racnode2, component: cssagent, with time stamp: L-2012-07-16-16:13:25.020
[ohasd(17652)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-07-16 16:19:34.196
[ohasd(17652)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred
2012-07-16 16:19:36.442
[/ocw/grid/bin/oraagent.bin(21497)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/ocw/grid/log/racnode2/agent/ohasd/oraagent_oracle/oraagent_oracle.log"
2012-07-16 16:19:41.385
[ohasd(17652)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2012-07-16 16:19:41.406
[gpnpd(21770)]CRS-2328:GPNPD started on node racnode2.
2012-07-16 16:19:43.738
[cssd(22106)]CRS-1713:CSSD daemon is started in clustered mode
2012-07-16 16:19:45.545
[ohasd(17652)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2012-07-16 16:20:04.732
[cssd(22106)]CRS-1707:Lease acquisition for node racnode2 number 2 completed
..
2012-07-16 16:20:06.041
[cssd(22106)]CRS-1605:CSSD voting file is online: /dev/iscsi/iqn.2006-01.com.openfiler:blnhp1102-vm31.asm-crs.disk1.fg1.box3; details in /ocw/grid/log/racnode2/cssd/ocssd.log.
2012-07-16 16:20:06.078
[cssd(22106)]CRS-1605:CSSD voting file is online: /dev/mpath/360060160d840280010c7b6217ecbe111; details in /ocw/grid/log/racnode2/cssd/ocssd.log.
2012-07-16 16:20:06.132
[cssd(22106)]CRS-1605:CSSD voting file is online: /dev/mpath/360060160ee4028006c1fa21d7ecbe111; details in /ocw/grid/log/racnode2/cssd/ocssd.log.
2012-07-16 16:29:42.783
[/ocw/grid/bin/cssdagent(22084)]CRS-5818:Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:2} in /ocw/grid/log/racnode2/agent/ohasd/oracssdagent_root/oracssdagent_root.log.    ##>> cssd failed after 10 minutes as it can't join the cluster
2012-07-16 16:29:42.783
[cssd(22106)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /ocw/grid/log/racnode2/cssd/ocssd.log
2012-07-16 16:29:42.783
[cssd(22106)]CRS-1603:CSSD on node racnode2 shutdown by user.
2012-07-16 16:29:48.176
[ohasd(17652)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'racnode2'.
2012-07-16 16:29:49.593
[cssd(12197)]CRS-1713:CSSD daemon is started in clustered mode      ##>> cssd auto restarting after first failure
2012-07-16 16:29:51.370
[ohasd(17652)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2012-07-16 16:30:12.581
[cssd(12197)]CRS-1707:Lease acquisition for node racnode2 number 2 completed
2012-07-16 16:30:13.901
[cssd(12197)]CRS-1605:CSSD voting file is online: /dev/iscsi/iqn.2006-01.com.openfiler:blnhp1102-vm31.asm-crs.disk1.fg1.box3; details in /ocw/grid/log/racnode2/cssd/ocssd.log.
2012-07-16 16:30:13.922
[cssd(12197)]CRS-1605:CSSD voting file is online: /dev/mpath/360060160d840280010c7b6217ecbe111; details in /ocw/grid/log/racnode2/cssd/ocssd.log.
2012-07-16 16:30:13.926
[cssd(12197)]CRS-1605:CSSD voting file is online: /dev/mpath/360060160ee4028006c1fa21d7ecbe111; details in /ocw/grid/log/racnode2/cssd/ocssd.log.
  
  • <GI_HOME>/log/<node>/cssd/ocssd.log - node1
2012-07-16 16:12:56.236: [GIPCHGEN][1091303744] gipchaInterfaceFail: marking interface failing 0xc058600 { host 'racnode2', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.168.44.10:21857', subnet '192.168.44.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x6 }
2012-07-16 16:12:56.238: [GIPCHGEN][1091303744] gipchaInterfaceDisable: disabling interface 0xbd4bd50 { host '', haName 'CSS_racnode1_X45', local (nil), ip '192.168.44.5:35675', subnet '192.168.44.0', mask '255.255.255.0', mac '00-26-55-52-75-32', ifname 'bond0', numRef 0, numFail 2, idxBoot 0, flags 0x19cd }
2012-07-16 16:12:56.238: [GIPCHGEN][1091303744] gipchaInterfaceDisable: disabling interface 0x2aaab4081d60 { host 'racnode2', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.168.44.5:35675', subnet '192.168.44.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 3, flags 0x18c6 }
..
2012-07-16 16:12:56.238: [GIPCHALO][1091303744] gipchaLowerCleanInterfaces: performing cleanup of disabled interface 0xc058600 { host 'racnode2', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.168.44.10:21857', subnet '192.168.44.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0xa6 }
2012-07-16 16:12:56.918: [GIPCHGEN][1091303744] gipchaInterfaceDisable: disabling interface 0x2aaab4081d60 { host 'racnode2', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.168.44.5:35675', subnet '192.168.44.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 3, flags 0x18e6 }
2012-07-16 16:12:56.918: [GIPCHGEN][1091303744] gipchaNodeAddInterface: adding interface information for inf 0xbd54c10 { host 'racnode2', haName 'CSS_racnode1_X45', local (nil), ip '192.168.44.5:35675', subnet '192.168.44.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x1842 }
2012-07-16 16:12:56.918: [GIPCHTHR][1091303744] gipchaWorkerCreateInterface: created remote bootstrap multicast interface for node 'racnode2', haName 'CSS_racnode1_X45', inf 'mcast://224.0.0.251:42424/192.168.44.5:35675'
2012-07-16 16:12:56.918: [GIPCHTHR][1091303744] gipchaWorkerCreateInterface: created remote bootstrap multicast interface for node 'racnode2', haName 'CSS_racnode1_X45', inf 'mcast://230.0.1.0:42424/192.168.44.5:35675'
2012-07-16 16:12:56.918: [GIPCHTHR][1091303744] gipchaWorkerCreateInterface: created remote bootstrap broadcast interface for node 'racnode2', haName 'CSS_racnode1_X45', inf 'udp://192.168.44.255:35675'
..
2012-07-16 16:13:21.870: [    CSSD][1117145408]clssnmPollingThread: node racnode2 (2) at 90% heartbeat fatal, removal in 2.170 seconds, seedhbimpd 1
2012-07-16 16:13:21.870: [    CSSD][1109260608]clssnmvDHBValidateNCopy: node 2, racnode2, has a disk HB, but no network HB, DHB has rcfg 237506449, wrtcnt, 1203540, LATS 416476174, lastSeqNo 1203534, uniqueness 1342447749, timestamp 1342448001/4294500870
2012-07-16 16:13:22.294: [GIPCHGEN][1091303744] gipchaInterfaceDisable: disabling interface 0x2aaab4081d60 { host 'racnode2', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.168.44.5:35675', subnet '192.168.44.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 3, flags 0x18e6 }
2012-07-16 16:13:22.873: [    CSSD][1097062720]clssnmvDHBValidateNCopy: node 2, racnode2, has a disk HB, but no network HB, DHB has rcfg 237506449, wrtcnt, 1203545, LATS 416477184, lastSeqNo 1202791, uniqueness 1342447749, timestamp 1342448002/4294501710
2012-07-16 16:13:22.873: [    CSSD][1109260608]clssnmvDHBValidateNCopy: node 2, racnode2, has a disk HB, but no network HB, DHB has rcfg 237506449, wrtcnt, 1203546, LATS 416477184, lastSeqNo 1203540, uniqueness 1342447749, timestamp 1342448002/4294501870
..
2012-07-16 16:13:24.038: [    CSSD][1120299328](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, racnode2, from the cluster in incarnation 237506449, node birth incarnation 237506448, death incarnation 237506449, stateflags 0x224000 uniqueness value 1342447749
..
2012-07-16 16:13:24.038: [    CSSD][1097062720]clssnmvDiskEvict: Kill block write, file /dev/mpath/360060160ee4028006c1fa21d7ecbe111 flags 0x00010004, kill block unique 1342447749, stamp 416478344/416478344
..
2012-07-16 16:13:24.345: [    CSSD][1120299328]clssnmWaitOnEvictions: node 2, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1342448003, 4294503090, 234), seedhbimpd TRUE
2012-07-16 16:13:24.400: [    CSSD][1099016512]clssnmvDiskKillCheck: not evicted, file /dev/iscsi/iqn.2006-01.com.openfiler:blnhp1102-vm31.asm-crs.disk1.fg1.box3 flags 0x00000000, kill block unique 0, my unique 1342439461
..
2012-07-16 16:13:48.019: [GIPCHGEN][1091303744] gipchaInterfaceDisable: disabling interface 0x2aaab4081d60 { host 'racnode2', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.168.44.5:35675', subnet '192.168.44.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 3, flags 0x18e6 }
2012-07-16 16:13:48.022: [    CSSD][1120299328]clssnmWaitOnEvictions: node 2, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1342448004, 4294503340, 235), seedhbimpd TRUE
2012-07-16 16:13:48.022: [    CSSD][1120299328]clssnmWaitOnEviction: node(2) exceeded graceful shutdown period, IPMI-kill allowed if configured
2012-07-16 16:13:48.022: [    CSSD][1120299328]clssnmWaitOnEviction: Node kill could not beperformed. Admin or connection validation failed
2012-07-16 16:13:48.050: [GIPCHGEN][1091303744] gipchaInterfaceDisable: disabling interface 0x2aaab4081d60 { host 'racnode2', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.168.44.5:35675', subnet '192.168.44.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 3, flags 0x18e6 }
2012-07-16 16:13:48.085: [    CSSD][1120299328]clssnmWaitOnEvictions: node 2, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1342448004, 4294503340, 235), seedhbimpd TRUE
..
2012-07-16 16:13:55.041: [    CSSD][1120299328]clssnmWaitOnEvictions: node 2, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1342448004, 4294503340, 235), seedhbimpd TRUE
2012-07-16 16:13:55.041: [    CSSD][1120299328]clssnmCheckKillStatus: Node 2, racnode2, down, LATS(416479344),timeout(30000)
..
##>> node2 evicted as expected after network failure
2012-07-16 16:22:51.288: [GIPCHALO][1091303744] gipchaLowerProcessNode: no valid interfaces found to node for 595960 ms, node 0x2aaaac7da520 { host 'racnode2', haName 'CSS_racnode1_X45', srcLuid bc1c4e13-f18e423e, dstLuid 18dd3893-1159b293 numInf 1, contigSeq 678, lastAck 648, lastValidAck 678, sendSeq [1018 : 1018], createTime 416247444, sentRegister 1, localMonitor 1, flags 0x2408 }
..
2012-07-16 16:22:55.290: [GIPCHALO][1091303744] gipchaLowerProcessNode: bootstrap node considered dead because of idle connection time 600960 ms, node 0x2aaaac7da520 { host 'racnode2', haName 'CSS_racnode1_X45', srcLuid bc1c4e13-f18e423e, dstLuid 18dd3893-1159b293 numInf 1, contigSeq 678, lastAck 648, lastValidAck 678, sendSeq [1020 : 1020], createTime 416247444, sentRegister 1, localMonitor 1, flags 0x2408 }
##>> even network is restored more than 5 minutes earlier, gipc layer didn't recover it properly 

  • <GI_HOME>/log/<node>/gipcd/gipcd.log - node1
2012-07-16 16:12:41.233: [ CLSINET][1106753856] Returning NETDATA: 1 interfaces
2012-07-16 16:12:41.233: [ CLSINET][1106753856] # 0 Interface 'bond0',ip='192.168.44.5',mac='00-26-55-52-75-32',mask='255.255.255.0',net='192.168.44.0',use='cluster_interconnect'
2012-07-16 16:12:41.243: [GIPCDMON][1106753856] gipcdMonitorSaveInfMetrics: inf[ 0]  bond0                - rank   99, avgms 0.561798 [ 351 / 356 / 356 ]
..
2012-07-16 16:12:56.233: [ CLSINET][1106753856] Returning NETDATA: 0 interfaces
2012-07-16 16:12:56.233: [GIPCDMON][1106753856] gipcdMonitorUpdate: interface went down - [ ip 192.168.44.5, subnet 192.168.44.0, mask 255.255.255.0, mac 00-26-55-52-75-32, ifname bond0 ]
2012-07-16 16:12:56.233: [GIPCDMON][1106753856] gipcdMonitorUpdate: msg sent to client thread (([update(ip: 192.168.44.5, mask: 255.255.255.0, subnet: 192.168.44.0, mac: 00-26-55-52-75-32, ifname: bond0), state(gipcdadapterstateDown)]))
2012-07-16 16:12:56.233: [GIPCDNDE][1106753856] gipcdEnqueueMsgForClient: Enqueuing a msg for client : req:[hostname(), id (0000000000000000, 0000000000000000), len(632), req cookie(0), type(gipcdReqTypeMonitorUpdate)]
2012-07-16 16:12:56.233: [GIPCDCLT][1086585152] gipcdProcessNodeRequest: Dequeued req from host , type: gipcdReqTypeMonitorUpdate, id: 0000000000000000, 0000000000000000
2012-07-16 16:12:56.233: [GIPCDCLT][1086585152] gipcdRawInterfaceUpdates: ([update(ip: 192.168.44.5, mask: 255.255.255.0, subnet: 192.168.44.0, mac: 00-26-55-52-75-32, ifname: bond0), state(gipcdadapterstateDown)])
2012-07-16 16:12:56.233: [GIPCHGEN][1108855104] gipchaInterfaceFail: marking interface failing 0x2aaaac3210f0 { host '', haName 'gipcd_ha_name', local (nil), ip '192.168.44.5:25216', subnet '192.168.44.0', mask '255.255.255.0', mac '00-26-55-52-75-32', ifname 'bond0', numRef 1, numFail 0, idxBoot 0, flags 0x184d }
2012-07-16 16:12:56.233: [GIPCDCLT][1086585152] gipcdClientThread: Sent req to local client on endp 0000000000000366:(type(gipcdmsgtypeInterfaceUpdate), endp(0000000000000366), len(1032), buf(0x2aaab001e270):[hostname(racnode1), req cookie(0), haname(),update(ip: 192.168.44.5, mask: 255.255.255.0, subnet: 192.168.44.0, mac: 00-26-55-52-75-32, ifname: bond0), state(gipcdadapterstateDown), retStatus(gipcretSuccess)])
..
2012-07-16 16:13:43.204: [GIPCHALO][1088686400] gipchaLowerProcessNode: no valid interfaces found to node for 47210 ms, node 0x2aaaac2a3e60 { host 'racnode2', haName 'gipcd_ha_name', srcLuid c2d0fd44-d5ea5339, dstLuid a23f2c9c-f4f6d438 numInf 0, contigSeq 23, lastAck 24, lastValidAck 23, sendSeq [25 : 28], createTime 416251644, sentRegister 1, localMonitor 1, flags 0x2008 }
2012-07-16 16:13:46.248: [ CLSINET][1106753856] Returning NETDATA: 0 interfaces
..
2012-07-16 16:15:01.279: [ CLSINET][1106753856] Returning NETDATA: 1 interfaces
2012-07-16 16:15:01.279: [ CLSINET][1106753856] # 0 Interface 'bond0',ip='192.168.44.5',mac='00-26-55-52-75-32',mask='255.255.255.0',net='192.168.44.0',use='cluster_interconnect'
2012-07-16 16:15:01.279: [GIPCDMON][1106753856] gipcdMonitorUpdate: New Interface found - [ ip 192.168.44.5, subnet 192.168.44.0, mask 255.255.255.0, mac 00-26-55-52-75-32, ifname bond0]
2012-07-16 16:15:01.279: [GIPCDMON][1106753856] gipcdMonitorUpdate: msg sent to client thread (([update(ip: 192.168.44.5, mask: 255.255.255.0, subnet: 192.168.44.0, mac: 00-26-55-52-75-32, ifname: bond0), state(gipcdadapterstateUp)]))
2012-07-16 16:15:01.279: [GIPCDNDE][1106753856] gipcdEnqueueMsgForClient: Enqueuing a msg for client : req:[hostname(), id (0000000000000000, 0000000000000000), len(632), req cookie(0), type(gipcdReqTypeMonitorUpdate)]
2012-07-16 16:15:01.279: [GIPCDCLT][1086585152] gipcdProcessNodeRequest: Dequeued req from host , type: gipcdReqTypeMonitorUpdate, id: 0000000000000000, 0000000000000000
2012-07-16 16:15:01.279: [GIPCDCLT][1086585152] gipcdRawInterfaceUpdates: ([update(ip: 192.168.44.5, mask: 255.255.255.0, subnet: 192.168.44.0, mac: 00-26-55-52-75-32, ifname: bond0), state(gipcdadapterstateUp)])
2012-07-16 16:15:01.280: [GIPCHGEN][1108855104] gipchaNodeAddInterface: adding interface information for inf 0x2aaaac220350 { host '', haName 'gipcd_ha_name', local (nil), ip '192.168.44.5', subnet '192.168.44.0', mask '255.255.255.0', mac '00-26-55-52-75-32', ifname 'bond0', numRef 0, numFail 0, idxBoot 0, flags 0x1841 }
2012-07-16 16:15:01.280: [GIPCDCLT][1086585152] gipcdClientThread: Sent req to local client on endp 0000000000000366:(type(gipcdmsgtypeInterfaceUpdate), endp(0000000000000366), len(1032), buf(0x2aaab0000e60):[hostname(racnode1), req cookie(0), haname(),update(ip: 192.168.44.5, mask: 255.255.255.0, subnet: 192.168.44.0, mac: 00-26-55-52-75-32, ifname: bond0), state(gipcdadapterstateUp), retStatus(gipcretSuccess)])
..
2012-07-16 16:15:01.290: [GIPCDCLT][1086585152] gipcdClientThread: req from local client of type gipcdmsgtypeInterfacePublish, endp 0000000000000902
2012-07-16 16:15:01.290: [GIPCDCLT][1086585152] gipcdClientInterfacePublish: Received type(gipcdmsgtypeInterfacePublish), endp(0000000000000902), len(1032), buf(0x2aaab01efd28):[hostname(racnode1), req cookie(2891985472), haname(a523-7ff1-62c0-1781),update(ip: 192.168.44.5:21796, mask: 255.255.255.0, subnet: 192.168.44.0, mac: , ifname: ), state(gipcdadapterstateDown), retStatus(gipcretSuccess)]
2012-07-16 16:15:01.290: [GIPCDCLT][1086585152] gipcdClientInterfacePublish: Interface added ip: 192.168.44.5:21796, mask: 255.255.255.0, subnet: 192.168.44.0, mac: , ifname:
2012-07-16 16:15:01.290: [GIPCDCLT][1086585152] gipcdClientThread: Sent req to local client on endp 0000000000000902:(type(gipcdmsgtypeInterfacePublishAck), endp(0000000000000902), len(1032), buf(0x2aaab01eb690):[hostname(), req cookie(2891985472), haname(),update(), state(gipcdadapterstateDown), retStatus(gipcretSuccess)])
..
2012-07-16 16:15:02.262: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created local interface for node 'racnode1', haName 'gipcd_ha_name', inf 'udp://192.168.44.5:32823'
2012-07-16 16:15:02.262: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created local bootstrap multicast interface for node 'racnode1', haName 'gipcd_ha_name', inf 'mcast://224.0.0.251:42424/192.168.44.5'
2012-07-16 16:15:02.263: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created local bootstrap multicast interface for node 'racnode1', haName 'gipcd_ha_name', inf 'mcast://230.0.1.0:42424/192.168.44.5'
2012-07-16 16:15:02.263: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created local bootstrap broadcast interface for node 'racnode1', haName 'gipcd_ha_name', inf 'udp://192.168.44.255:42424'
2012-07-16 16:15:02.263: [GIPCHGEN][1088686400] gipchaNodeAddInterface: adding interface information for inf 0x2aaaac3210f0 { host 'racnode2', haName 'gipcd_ha_name', local (nil), ip '192.168.44.5:32823', subnet '192.168.44.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x1842 }
2012-07-16 16:15:02.263: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created remote bootstrap multicast interface for node 'racnode2', haName 'gipcd_ha_name', inf 'mcast://224.0.0.251:42424/192.168.44.5:32823'
2012-07-16 16:15:02.263: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created remote bootstrap multicast interface for node 'racnode2', haName 'gipcd_ha_name', inf 'mcast://230.0.1.0:42424/192.168.44.5:32823'
2012-07-16 16:15:02.263: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created remote bootstrap broadcast interface for node 'racnode2', haName 'gipcd_ha_name', inf 'udp://192.168.44.255:32823'
2012-07-16 16:15:02.263: [GIPCHGEN][1088686400] gipchaWorkerAttachInterface: Interface attached inf 0x2aaaac3210f0 { host 'racnode2', haName 'gipcd_ha_name', local 0x2aaaac220350, ip '192.168.44.5:32823', subnet '192.168.44.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x1846 }
..
2012-07-16 16:15:06.284: [ CLSINET][1106753856] Returning NETDATA: 1 interfaces
2012-07-16 16:15:06.284: [ CLSINET][1106753856] # 0 Interface 'bond0',ip='192.168.44.5',mac='00-26-55-52-75-32',mask='255.255.255.0',net='192.168.44.0',use='cluster_interconnect'
..
2012-07-16 16:15:11.267: [GIPCDMON][1106753856] gipcdMonitorSaveInfMetrics: inf[ 0]  bond0                - rank    0, avgms 30000000000.000000 [ 4 / 0 / 0 ]        ##>> rank stayed 0 after network is restored
..
2012-07-16 16:23:11.363: [GIPCDMON][1106753856] gipcdMonitorSaveInfMetrics: inf[ 0]  bond0                - rank    0, avgms 30000000000.000000 [ 19 / 0 / 0 ]
2012-07-16 16:23:11.463: [ CLSINET][1106753856] Returning NETDATA: 1 interfaces
2012-07-16 16:23:11.463: [ CLSINET][1106753856] # 0 Interface 'bond0',ip='192.168.44.5',mac='00-26-55-52-75-32',mask='255.255.255.0',net='192.168.44.0',use='cluster_interconnect'
..
2012-07-16 16:23:41.378: [GIPCDMON][1106753856] gipcdMonitorSaveInfMetrics: inf[ 0]  bond0                - rank   -1, avgms 30000000000.000000 [ 0 / 0 / 0 ]        ##>> rank changed to "-1" 10 minutes after the failure although it was restored a few minutes earlier
2012-07-16 16:23:41.378: [GIPCDMON][1106753856] gipcdMonitorSaveInfMetrics: saving: 
2012-07-16 16:23:41.474: [ CLSINET][1106753856] Returning NETDATA: 1 interfaces
2012-07-16 16:23:41.474: [ CLSINET][1106753856] # 0 Interface 'bond0',ip='192.168.44.5',mac='00-26-55-52-75-32',mask='255.255.255.0',net='192.168.44.0',use='cluster_interconnect'

  • <GI_HOME>/log/<node>/cssd/ocssd.log - node2
2012-07-16 16:13:10.052: [    CSSD][1111189824]clssnmPollingThread: node racnode1 (1) at 50% heartbeat fatal, removal in 14.180 seconds
2012-07-16 16:13:10.052: [    CSSD][1111189824]clssnmPollingThread: node racnode1 (1) is impending reconfig, flag 2491406, misstime 15820
2012-07-16 16:13:10.052: [    CSSD][1111189824]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2012-07-16 16:13:10.052: [    CSSD][1090488640]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 237506449, wrtcnt, 1214131, LATS 4294489160, lastSeqNo 0, uniqueness 1342439461, timestamp 1342447989/416463944
..
2012-07-16 16:13:24.243: [    CSSD][1087187264](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node racnode1, number 1, sync 237506449, stamp 416478344     ##>> node got evicted after network failure
2012-07-16 16:13:24.243: [    CSSD][1115920704]clssnmQueueClientEvent: Node[1] state = 5, birth = 237506444, unique = 1342439461
2012-07-16 16:13:24.243: [    CSSD][1087187264]###################################
2012-07-16 16:13:24.243: [    CSSD][1115920704]clssnmQueueClientEvent: Node[2] state = 3, birth = 237506448, unique = 1342447749
2012-07-16 16:13:24.243: [    CSSD][1102309696](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node racnode1, number 1, sync 237506449, stamp 416478344
2012-07-16 16:13:24.243: [    CSSD][1087187264]clssscExit: CSSD aborting from thread clssnmvKillBlockThread
..
2012-07-16 16:19:43.738: [    CSSD][1806091136]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1342448383        ##>> cssd auto start upon OS reboot
..
2012-07-16 16:19:48.708: [    CSSD][1077565760]clssnmCompleteVFDiscovery: Committed configuration for CIN 0:1342040761:0
2012-07-16 16:19:48.708: [    CSSD][1077565760]  misscount          30    reboot latency      3
2012-07-16 16:19:48.708: [    CSSD][1077565760]  long I/O timeout  200    short I/O timeout  27
2012-07-16 16:19:48.708: [    CSSD][1077565760]  diagnostic wait     0  active version 11.2.0.3.0
2012-07-16 16:19:48.708: [    CSSD][1077565760]  Listing unique IDs for 3 voting files:
2012-07-16 16:19:48.708: [    CSSD][1077565760]    voting file 1: 476c9740-71d74f55-bfc6991c-cae65a5b
2012-07-16 16:19:48.708: [    CSSD][1077565760]    voting file 2: 0716c636-607e4f6a-bfcf0d1b-8ef14814
2012-07-16 16:19:48.708: [    CSSD][1077565760]    voting file 3: 9fc2c40b-41674fdf-bf3295f9-ef04d8d9
2012-07-16 16:19:48.708: [    CSSD][1806091136]clssnmOpenGIPCEndp: opening cluster listener on gipc://racnode2:nm_racnode1_X45
..
2012-07-16 16:19:49.217: [ CLSINET][1092094272] Returning NETDATA: 1 interfaces
2012-07-16 16:19:49.217: [ CLSINET][1092094272] # 0 Interface 'bond0',ip='192.168.44.10',mac='18-a9-05-4c-6f-1a',mask='255.255.255.0',net='192.168.44.0',use='cluster_interconnect'
2012-07-16 16:19:49.217: [GIPCHGEN][1092094272] gipchaNodeAddInterface: adding interface information for inf 0x1d6e58f0 { host '', haName 'CSS_racnode1_X45', local (nil), ip '192.168.44.10', subnet '192.168.44.0', mask '255.255.255.0', mac '18-a9-05-4c-6f-1a', ifname 'bond0', numRef 0, numFail 0, idxBoot 0, flags 0x1841 }
2012-07-16 16:19:49.217: [GIPCHTHR][1090517312] gipchaWorkerCreateInterface: created local interface for node 'racnode2', haName 'CSS_racnode1_X45', inf 'udp://192.168.44.10:37726'
2012-07-16 16:19:49.217: [GIPCHTHR][1090517312] gipchaWorkerCreateInterface: created local bootstrap multicast interface for node 'racnode2', haName 'CSS_racnode1_X45', inf 'mcast://224.0.0.251:42424/192.168.44.10'
2012-07-16 16:19:49.218: [GIPCHTHR][1090517312] gipchaWorkerCreateInterface: created local bootstrap multicast interface for node 'racnode2', haName 'CSS_racnode1_X45', inf 'mcast://230.0.1.0:42424/192.168.44.10'
2012-07-16 16:19:49.218: [GIPCHTHR][1090517312] gipchaWorkerCreateInterface: created local bootstrap broadcast interface for node 'racnode2', haName 'CSS_racnode1_X45', inf 'udp://192.168.44.255:42424'
2012-07-16 16:19:49.218: [    CSSD][1806091136]clssnmOpenGIPCEndp: listening on gipcha://racnode2:nm2_racnode1_X45
..
2012-07-16 16:29:37.835: [    CSSD][1108764992]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 237506450, wrtcnt, 1217177, LATS 4294841470, lastSeqNo 1217167, uniqueness 1342439461, timestamp 1342448976/417450424
2012-07-16 16:29:38.039: [    CSSD][1103366464]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 237506450, wrtcnt, 1217178, LATS 4294841680, lastSeqNo 1217176, uniqueness 1342439461, timestamp 1342448977/417451294
2012-07-16 16:29:38.194: [    CSSD][1075505472]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 237506450, wrtcnt, 1217179, LATS 4294841830, lastSeqNo 1217169, uniqueness 1342439461, timestamp 1342448977/417451314      ##>> message repeated for about 10 minutes
..
2012-07-16 16:29:41.806: [    CSSD][1116649792]clssnmRcfgMgrThread: Local Join
2012-07-16 16:29:41.806: [    CSSD][1116649792]clssnmLocalJoinEvent: begin on node(2), waittime 193000
2012-07-16 16:29:41.806: [    CSSD][1116649792]clssnmLocalJoinEvent: set curtime (4294845440) for my node
2012-07-16 16:29:41.806: [    CSSD][1116649792]clssnmLocalJoinEvent: scanning 32 nodes
2012-07-16 16:29:41.806: [    CSSD][1116649792]clssnmLocalJoinEvent: Node racnode1, number 1, is in an existing cluster with disk state 3
2012-07-16 16:29:41.806: [    CSSD][1116649792]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk                     ##>> message repeated for about 10 minutes
..
2012-07-16 16:29:42.783: [    CSSD][1085380928]clssgmExecuteClientRequest: MAINT recvd from proc 2 (0x1d4527f0)
2012-07-16 16:29:42.783: [    CSSD][1085380928]clssgmShutDown: Received abortive shutdown request from client.
2012-07-16 16:29:42.783: [    CSSD][1085380928]###################################
2012-07-16 16:29:42.783: [    CSSD][1085380928]clssscExit: CSSD aborting from thread GMClientListener
2012-07-16 16:29:42.783: [    CSSD][1085380928]###################################
2012-07-16 16:29:42.783: [    CSSD][1085380928](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
..
2012-07-16 16:29:49.592: [    CSSD][2254918528]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1342448989

  • <GI_HOME>/log/<node>/gipcd/gipcd.log - node2
..
2012-07-16 16:19:53.621: [ CLSINET][1108175168] Returning NETDATA: 1 interfaces
2012-07-16 16:19:53.621: [ CLSINET][1108175168] # 0 Interface 'bond0',ip='192.168.44.10',mac='18-a9-05-4c-6f-1a',mask='255.255.255.0',net='192.168.44.0',use='cluster_interconnect'
2012-07-16 16:19:53.621: [GIPCDMON][1108175168] gipcdMonitorUpdate: New Interface found - [ ip 192.168.44.10, subnet 192.168.44.0, mask 255.255.255.0, mac 18-a9-05-4c-6f-1a, ifname bond0]
2012-07-16 16:19:53.621: [GIPCDMON][1108175168] gipcdMonitorUpdate: msg sent to client thread (([update(ip: 192.168.44.10, mask: 255.255.255.0, subnet: 192.168.44.0, mac: 18-a9-05-4c-6f-1a, ifname: bond0), state(gipcdadapterstateUp)]))
2012-07-16 16:19:53.621: [GIPCDNDE][1108175168] gipcdEnqueueMsgForClient: Enqueuing a msg for client : req:[hostname(), id (0000000000000000, 0000000000000000), len(632), req cookie(0), type(gipcdReqTypeMonitorUpdate)]
2012-07-16 16:19:53.621: [GIPCDMON][1108175168] gipcdMonitorSaveInfMetrics: inf[ 0]  bond0                - rank   -1, avgms 30000000000.000000 [ 0 / 0 / 0 ]
2012-07-16 16:19:53.621: [GIPCDMON][1108175168] gipcdMonitorSaveInfMetrics: saving: 
..
2012-07-16 16:20:04.600: [GIPCDMON][1108175168] gipcdMonitorCssInit: Successfully connected to CSS
2012-07-16 16:20:08.593: [ CLSINET][1108175168] Returning NETDATA: 1 interfaces
2012-07-16 16:20:08.593: [ CLSINET][1108175168] # 0 Interface 'bond0',ip='192.168.44.10',mac='18-a9-05-4c-6f-1a',mask='255.255.255.0',net='192.168.44.0',use='cluster_interconnect'
..
2012-07-16 16:20:23.632: [GIPCDMON][1108175168] gipcdMonitorSaveInfMetrics: inf[ 0]  bond0                - rank    0, avgms 30000000000.000000 [ 15 / 0 / 0 ]
2012-07-16 16:20:23.632: [GIPCDMON][1108175168] gipcdMonitorSaveInfMetrics: saving:  bond0:0
  
  • OS messages could be different depending on platform     
hp-ux 
Aug 22 19:48:12 mosr4d1 vmunix: APA/LM: FOG:lan900 is down
Aug 22 19:48:12 mosr4d1 cmnetd[3940]: lan900 is down at the data link layer.
Aug 22 19:48:12 mosr4d1 cmnetd[3940]: lan900 failed.
Aug 22 19:48:12 mosr4d1 cmnetd[3940]: Subnet 192.57.253.0 down
 

AIX

LABEL:  ECH_CHAN_RCVRY
..
Date/Time:       Wed Aug  8 10:57:05 2012
..
Description
ETHERCHANNEL RECOVERY
..
Detail Data
A primary adapter in the EtherChannel recovered: returning to main channel
---------------------------------------------------------------------------
LABEL:  GOENT_RCVRY_EXIT
..
Date/Time:       Wed Aug  8 10:57:05 2012
..
Description
ETHERNET NETWORK RECOVERY MODE

---------------------------------------------------------------------------
LABEL:  ECH_TOTAL_CHAN_FAIL
..
Date/Time:       Wed Aug  8 10:30:26 2012
..
Description
TOTAL ETHERCHANNEL FAILURE
..
Detail Data
Whole EtherChannel is down: backup adapter failed while operating on backup
---------------------------------------------------------------------------
LABEL:  GOENT_LINK_DOWN
..
Date/Time:       Wed Aug  8 10:30:26 2012
..
Description
ETHERNET DOWN
---------------------------------------------------------------------------
LABEL:  ECH_CHAN_FAIL
..
Date/Time:       Wed Aug  8 10:30:23 2012
..
Description
ETHERCHANNEL FAILOVER
..
Detail Data
All primary EtherChannel adapters failed: switching over to backup adapter


 
 
  

CAUSE

Refer to Solution section 


SOLUTION


#1: Due to bug 13653178 - Node cannot join the cluster after reboot or "interconnect restored"

The bug is fixed in 11.2.0.4 
It's recommended to apply latest patch set and patch set update; for PSU5 environment where PSU6 or higher can't be applied, patch 13899736 must be applied, refer to note 1500862.1 for details.


#2: Due to bug 13899736 - Node cannot join the cluster after reboot or "interconnect restored"

The bug is fixed in 11.2.0.3 GI PSU6, 11.2.0.4, 12.1.0.2 onward, it's recommended to apply latest patch set and patch set update.

The workaround for both #1 and #2 is:

Step1. On surviving node, during non-peak time if possible, kill gipcd.bin process (kill -15 <gipcd.bin ospid>)
NOTE:  In 11.2 will also lead to death of evmd.bin, crsd.bin and ctssd.bin processes. None of these processes are fatal. Clusterware will respawn all them automatically.
Step2. Once gipcd.bin, evmd.bin, crsd.bin and ctssd.bin processes have been re-spawned on the surviving node, verify whether other nodes join cluster.
Step3. Most of the time, GI will start, but in case it does not, re-start GI on the other nodes with crsctl command.

Step4. Finally, if GI is still not starting, as a last resort, restart GI on the surviving node.

Niciun comentariu:

Trimiteți un comentariu