Has anyone experienced the following error/hang/loop when attempting to stop rgmanager or cman on the last node of a two node cluster? groupd[4909]: cpg_leave error retrying Basic scenario: RHEL 5.7 with the latest errata for cman. Create a two node cluster with qdisk and higher totem token=70000 start cman on both nodes, wait for qdisk to become online with master determined stop cman on node1, wait for it to complete stop cman on node2 error "cpg_leave" seen in logging output Observations: The "service cman stop" command hangs at "Stopping fencing" output If I cycle openais service with "service openais restart", then the "service cman stop" will complete (need to manually stop the openais service afterwards). When hung, the command "group_tool dump" hangs (any group_tool command hangs). The hang is inconsistent which, in my mind, implies a timing issue. Inconsistent meaning that every once in a while, then shutdown will complete (maybe 20% of the time). I have seen the issue with the stopping of rgmanager and cman. The below example has been stripped down to show the hang with cman. I have tested with varying the length of time to wait before stopping the second node with no difference (hang still occurs periodically). I have tested with commenting out the totem token and the quorum_dev_poll and still experienced the hang. (we use the longer timeouts to help survive network and san blips)/ I have dug through some of the source code. The message appears in group's cpg.c as function do_cpg_leave( ). This calls the cpg_leave function located in the openais package. If I attach to the groupd process with gdb, I get the following stack. Watching with strace, groupd is just in a looping state. (gdb) where #0 0x000000341409a510 in __nanosleep_nocancel () from /lib64/libc.so.6 #1 0x000000341409a364 in sleep () from /lib64/libc.so.6 #2 0x000000000040a410 in time () #3 0x000000000040bd09 in time () #4 0x000000000040e2cb in time () #5 0x000000000040ebe0 in time () #6 0x000000000040f394 in time () #7 0x000000341401d994 in __libc_start_main () from /lib64/libc.so.6 #8 0x00000000004018f9 in time () #9 0x00007fff04a671c8 in ?? () #10 0x0000000000000000 in ?? () If I attach to the aisexec process with gdb, I see the following: (gdb) where #0 0x00000034140cb696 in poll () from /lib64/libc.so.6 #1 0x0000000000405c50 in poll_run () #2 0x0000000000418aae in main () As you can see in the cluster.conf example below, I have attempted many different ways to create more debug logging. I do see debug messages from openais in the cpg.c component during startup, but nothing is logged on the shutdown hang scenario. I would appreciate any guidance on how to troubleshoot further, especially with increasing the tracing of the openais calls in cpg.c. Thanks Robert Example cluster.conf: <?xml version="1.0"?> <cluster config_version="33" name="cluster_app_1"> <logging to_syslog="yes" syslog_facility="local4" timestamp="on" debug="on"> <logger ident="CPG" debug="on"/> <logger ident="CMAN" debug="on"/> </logging> <cman expected_nodes="2" expected_votes="3" quorum_dev_poll="70000"> <multicast addr="239.192.1.192"/> </cman> <totem token="70000"/> <fence_daemon clean_start="0" log_facility="local4" post_fail_delay="10" post_join_delay="60"/> <quorumd interval="1" label="rhcs_qdisk" log_facility="local4" log_level="7" min_score="1" tko="60" votes="1"> <heuristic interval="2" program="/bin/ping -c1 -t2 -Ibond0 10.162.106.1" score="1" tko="3"/> </quorumd> <clusternodes> <clusternode name="node1-priv" nodeid="1" votes="1"> <fence> <method name="1"> <device name="iLO_node1"/> </method> </fence> <multicast addr="239.192.1.192" interface="bond1"/> </clusternode> <clusternode name="node2-priv" nodeid="2" votes="1"> <fence> <method name="1"> <device name="iLO_node2"/> </method> </fence> <multicast addr="239.192.1.192" interface="bond1"/> </clusternode> </clusternodes> <fencedevices> <fencedevice action="off" agent="fence_ipmilan" ipaddr="X.X.X.X" login="node1_fence" name="iLO_node1" passwd="password" power_wait="10" lanplus="1"/> <fencedevice action="off" agent="fence_ipmilan" ipaddr="X.X.X.X" login="node2_fence" name="iLO_node2" passwd="password" power_wait="10" lanplus="1"/> </fencedevices> <rm log_level="7"/> </cluster> -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster