Last night one of my five cluster nodes suffered a hardware failure (memory, cpu?). The other nodes properly fenced the failed machine, but no matter what clusvcadm command I ran, I could not get the other cluster members to start, stop or disable the cluster resource group/service that had been running on the failed node. (the resource group/service that was running on the failed node includes an EXT3 fs, an IP address, a rsyncd and a smbd init script)
The "clusvcadm -d [service]" command would just hang for minutes and not return. "clustat" intially reported the rg/service in an unknown state, then stopped reporting rgmanager status and only showed cman status. The cluster remained quorate the entire time. Resource groups/services on non-failed nodes continued to run, but no matter what I tried I could not get rgmanager status on any node.
I had to reset the entire cluster to get things back to normal. (This is a heavily used operational system so I didn't have time to do further debugging.) My logs don't show any rgmanger related error messages, only fencing status:
Nov 6 20:24:37 bamf02 kernel: CMAN: removing node bamf03 from the cluster : Missed too many heartbeats
Nov 6 20:24:38 bamf02 fenced[5913]: fencing deferred to bamf01
---
Nov 6 20:24:37 bamf01 kernel: CMAN: node bamf03 has been removed from the cluster : Missed too many heartbeats
Nov 6 20:24:38 bamf01 fenced[5756]: bamf03 not a cluster member after 0 sec post_fail_delay
Nov 6 20:24:38 bamf01 fenced[5756]: fencing node "bamf03"
Nov 6 20:24:46 bamf01 fenced[5756]: fence "bamf03" success
Nov 6 20:30:36 bamf01 sshd(pam_unix)[27244]: session opened for user root by root(uid=0)
Nov 6 20:36:29 bamf01 kernel: CMAN: node bamf03 rejoining
Nov 6 20:42:55 bamf01 shutdown: shutting down for system reboot
---
I'm running RHEL4U4 (cman 1.0.11-0, cman-kernel-smp 2.6.9-45.5, dlm 1.0.1-1, magma 1.0.6-0 rgmanager 1.9.53) on x86_64 hardware.
Nov 6 20:17:48 bamf03 clurgmgrd: [4170]: <info> Executing /etc/init.d/rsyncd-cougar status Nov 6 20:17:51 bamf03 sshd(pam_unix)[10896]: session opened for user root by (uid=0) Nov 6 20:18:18 bamf03 clurgmgrd: [4170]: <info> Executing /etc/init.d/rsyncd-cougar status Nov 6 20:19:18 bamf03 last message repeated 2 times Nov 6 20:20:48 bamf03 last message repeated 3 times Nov 6 20:21:18 bamf03 clurgmgrd: [4170]: <info> Executing /etc/init.d/rsyncd-cougar status Nov 6 20:21:34 bamf03 kernel: Bad page state at prep_new_page (in process 'smbd', page 00000101fe80fec0) Nov 6 20:21:34 bamf03 kernel: flags:0x05001078 mapping:000001010c7f75e8 mapcount:0 count:2 Nov 6 20:21:34 bamf03 kernel: Backtrace: Nov 6 20:21:34 bamf03 kernel: Nov 6 20:21:34 bamf03 kernel: Call Trace:<ffffffff8015d383>{bad_page+112} <ffffffff8015dd41>{buffered_rmqueue+520} Nov 6 20:21:34 bamf03 kernel: <ffffffff802a721f>{sock_sendmsg+271} <ffffffff8015de7f>{__alloc_pages+211} Nov 6 20:21:34 bamf03 kernel: <ffffffff8015e145>{__get_free_pages+11} <ffffffff8018b3d7>{__pollwait+58} Nov 6 20:21:34 bamf03 kernel: <ffffffff802ad4af>{datagram_poll+39} <ffffffff802ad488>{datagram_poll+0} Nov 6 20:21:34 bamf03 kernel: <ffffffff802ad488>{datagram_poll+0} <ffffffff8018b6e8>{do_select+656} Nov 6 20:21:34 bamf03 kernel: <ffffffff8018b39d>{__pollwait+0} <ffffffff8018bb82>{sys_select+820} Nov 6 20:21:34 bamf03 kernel: <ffffffff801932d8>{dnotify_parent+34} <ffffffff8011026a>{system_call+126} Nov 6 20:21:34 bamf03 kernel: Nov 6 20:21:34 bamf03 kernel: Trying to fix it up, but a reboot is needed Nov 6 20:21:48 bamf03 clurgmgrd: [4170]: <info> Executing /etc/init.d/rsyncd-cougar status Nov 6 20:22:08 bamf03 kernel: Bad page state at prep_new_page (in process 'ip.sh', page 00000101fe80c730) Nov 6 20:22:08 bamf03 kernel: flags:0x0500102c mapping:0000010079d9a3e0 mapcount:0 count:2 Nov 6 20:22:08 bamf03 kernel: Backtrace: Nov 6 20:22:08 bamf03 kernel: Nov 6 20:22:08 bamf03 kernel: Call Trace:<ffffffff8015d383>{bad_page+112} <ffffffff8015dd41>{buffered_rmqueue+520} Nov 6 20:22:08 bamf03 kernel: <ffffffff8015de7f>{__alloc_pages+211} <ffffffff801696e6>{do_no_page+651} Nov 6 20:22:08 bamf03 kernel: <ffffffff8015c5cb>{__generic_file_aio_read+385} <ffffffff80169ca7>{handle_mm_fault+373} Nov 6 20:22:08 bamf03 kernel: <ffffffff8015c7af>{generic_file_aio_read+48} <ffffffff801793e8>{do_sync_read+173} Nov 6 20:22:08 bamf03 kernel: <ffffffff8018f01d>{dput+56} <ffffffff80123e9a>{do_page_fault+518} Nov 6 20:22:08 bamf03 kernel: <ffffffff80135756>{autoremove_wake_function+0} <ffffffff801932d8>{dnotify_parent+34} Nov 6 20:22:08 bamf03 kernel: <ffffffff8017950c>{vfs_read+248} <ffffffff80110d91>{error_exit+0} Nov 6 20:22:08 bamf03 kernel: Nov 6 20:22:08 bamf03 kernel: Trying to fix it up, but a reboot is needed Nov 6 20:22:16 bamf03 kernel: Bad page state at prep_new_page (in process 'smbd', page 00000101fe816ec0) Nov 6 20:22:16 bamf03 kernel: flags:0x05001028 mapping:000001018b7eea30 mapcount:0 count:2 Nov 6 20:22:16 bamf03 kernel: Backtrace: Nov 6 20:22:16 bamf03 kernel: Nov 6 20:22:16 bamf03 kernel: Call Trace:<ffffffff8015d383>{bad_page+112} <ffffffff8015dd41>{buffered_rmqueue+520} Nov 6 20:22:16 bamf03 kernel: <ffffffff802a721f>{sock_sendmsg+271} <ffffffff8015de7f>{__alloc_pages+211} Nov 6 20:22:16 bamf03 kernel: <ffffffff8015e145>{__get_free_pages+11} <ffffffff8018b3d7>{__pollwait+58} Nov 6 20:22:16 bamf03 kernel: <ffffffff802cff03>{tcp_poll+44} <ffffffff8018b6e8>{do_select+656} Nov 6 20:22:16 bamf03 kernel: <ffffffff8018b39d>{__pollwait+0} <ffffffff8018bb82>{sys_select+820} Nov 6 20:22:16 bamf03 kernel: <ffffffff801932d8>{dnotify_parent+34} <ffffffff8011026a>{system_call+126} Nov 6 20:22:16 bamf03 kernel: Nov 6 20:22:16 bamf03 kernel: Trying to fix it up, but a reboot is needed Nov 6 20:22:18 bamf03 clurgmgrd: [4170]: <info> Executing /etc/init.d/rsyncd-cougar status Nov 6 20:22:38 bamf03 clurgmgrd[4170]: <notice> Stopping service cougar-compout Nov 6 20:22:38 bamf03 clurgmgrd: [4170]: <info> Executing /etc/init.d/rsyncd-cougar stop Nov 6 20:22:38 bamf03 clurgmgrd: [4170]: <info> Removing IPv4 address 192.168.10.22 from bond0 Nov 6 20:22:41 bamf03 clurgmgrd: [4170]: <info> Stopping Samba instance "cougar" Nov 6 20:22:41 bamf03 nmbd[30156]: [2006/11/06 20:22:41, 0] nmbd/nmbd.c:terminate(56) Nov 6 20:22:41 bamf03 nmbd[30156]: Got SIGTERM: going down... Nov 6 20:22:41 bamf03 nmbd[30156]: [2006/11/06 20:22:41, 0] libsmb/nmblib.c:send_udp(790) Nov 6 20:22:41 bamf03 nmbd[30156]: Packet send failed to 192.168.255.255(138) ERRNO=Invalid argument Nov 6 20:23:10 bamf03 sshd(pam_unix)[13090]: session opened for user root by root(uid=0) Nov 6 20:24:16 bamf03 sshd(pam_unix)[13146]: session opened for user root by root(uid=0) Nov 6 20:24:36 bamf03 kernel: CMAN: removing node bamf01 from the cluster : Missed too many heartbeats Nov 6 20:24:38 bamf03 kernel: clustat[13184] trap stack segment rip:33512b1c13 rsp:7fbffff840 error:0 Nov 6 21:36:04 bamf03 syslogd 1.4.1: restart.
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster