I ran a series of reboots, and this problem is totally reproducible. Should I be opening a ticket at Red Hat Support on this? The problem is immediate with 'service rgmanager stop', as it hangs in its sleep loop forever, even though all nodes in the cluster report that it changed its state to down. But worse than that, it also hangs all GFS I/O and the load average on all nodes start to spike (>9.00) -- I see gfs_scand in top racing away. It only gets fixed when I manually 'power reset' the node, then I get the 'Missed too many heartbeats' followed by fencing. Help. Robert Hurst, Sr. Caché Administrator Beth Israel Deaconess Medical Center 1135 Tremont Street, REN-7 Boston, Massachusetts 02120-2140 617-754-8754 · Fax: 617-754-8730 · Cell: 401-787-3154 Any technology distinguishable from magic is insufficiently advanced. -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx on behalf of rhurst@xxxxxxxxxxxxxxxxx Sent: Tue 3/20/2007 11:39 AM To: linux-cluster@xxxxxxxxxx Subject: GFS/CS blocks all I/O on 1 server reboot of 11 nodes? Troubling, this behavior has not occurred prior to our Mar 2nd up2date on our RHEL GFS/CS subscription. I rebooted an application server (app1) in an 11-node cluster, and from viewing its console, it 'hung' on a service cman stop. Consequently, ALL GFS I/O got blocked on ALL nodes. All servers are configured the same: AMD64 dual CPU/duo core HP DL385, 8GB RAM, dual hba (PowerPath) # uname -r 2.6.9-42.0.10.ELsmp ccs-1.0.7-0 cman-1.0.11-0 dlm-1.0.1-1 fence-1.32.25-1 GFS-6.1.6-1 magma-1.0.6-0 magma-plugins-1.0.9-0 rgmanager-1.9.54-1 My central syslog server showed that all nodes registered the membership change, yet the service continued to hang. Mar 20 11:06:18 app1 shutdown: shutting down for system reboot Mar 20 11:06:18 app1 init: Switching to runlevel: 6 Mar 20 11:06:19 app1 rgmanager: [1873]: <notice> Shutting down Cluster Service Manager... Mar 20 11:06:20 app1 clurgmgrd[11220]: <notice> Shutting down Mar 20 11:06:20 net2 clurgmgrd[30893]: <info> State change: app1 DOWN Mar 20 11:06:20 app3 clurgmgrd[11092]: <info> State change: app1 DOWN Mar 20 11:06:20 db1 clurgmgrd[8351]: <info> State change: app1 DOWN Mar 20 11:06:20 db3 clurgmgrd[8279]: <info> State change: app1 DOWN Mar 20 11:06:20 db2 clurgmgrd[10875]: <info> State change: app1 DOWN Mar 20 11:06:20 app6 clurgmgrd[10959]: <info> State change: app1 DOWN Mar 20 11:06:20 app4 clurgmgrd[11146]: <info> State change: app1 DOWN Mar 20 11:06:20 app2 clurgmgrd[10835]: <info> State change: app1 DOWN Mar 20 11:06:20 app5 clurgmgrd[11198]: <info> State change: app1 DOWN Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> State change: app1 DOWN Mar 20 11:12:26 net2 kernel: CMAN: node app1 has been removed from the cluster : Missed too many heartbeats Mar 20 11:12:26 db2 kernel: CMAN: removing node app1 from the cluster : Missed too many heartbeats Mar 20 11:12:26 db3 kernel: CMAN: node app1 has been removed from the cluster : Missed too many heartbeats Mar 20 11:12:26 app4 kernel: CMAN: node app1 has been removed from the cluster : Missed too many heartbeats Mar 20 11:12:26 app5 kernel: CMAN: node app1 has been removed from the cluster : Missed too many heartbeats Mar 20 11:12:26 app6 kernel: CMAN: node app1 has been removed from the cluster : Missed too many heartbeats Mar 20 11:12:26 net1 kernel: CMAN: node app1 has been removed from the cluster : Missed too many heartbeats Mar 20 11:12:26 app3 kernel: CMAN: node app1 has been removed from the cluster : Missed too many heartbeats Mar 20 11:12:26 db1 kernel: CMAN: node app1 has been removed from the cluster : Missed too many heartbeats Mar 20 11:12:26 app2 kernel: CMAN: node app1 has been removed from the cluster : Missed too many heartbeats Mar 20 11:12:32 net1 fenced[10510]: app1 not a cluster member after 0 sec post_fail_delay Mar 20 11:12:32 net1 fenced[10510]: fencing node "app1" Mar 20 11:13:42 net1 fenced[10510]: fence "app1" success I issued a 'power reset' on its HP ILO management port to hardware reboot the server around 11:12. That is when the net1 server attempted to fence app1, after it was missing. Here's net1's syslog entries on that event: Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> Magma Event: Membership Change Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> State change: app1 DOWN Mar 20 11:12:26 net1 kernel: CMAN: node app1 has been removed from the cluster : Missed too many heartbeats Mar 20 11:12:32 net1 fenced[10510]: app1 not a cluster member after 0 sec post_fail_delay Mar 20 11:12:32 net1 fenced[10510]: fencing node "app1" Mar 20 11:13:42 net1 fenced[10510]: fence "app1" success Mar 20 11:15:45 net1 kernel: CMAN: node app1 rejoining Mar 20 11:18:05 net1 clurgmgrd[12689]: <info> Magma Event: Membership Change Mar 20 11:18:05 net1 clurgmgrd[12689]: <info> State change: app1 UP Robert Hurst, Sr. Caché Administrator Beth Israel Deaconess Medical Center 1135 Tremont Street, REN-7 Boston, Massachusetts 02120-2140 617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154 Any technology distinguishable from magic is insufficiently advanced.
<<winmail.dat>>
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster