> 3-node cluster. rgmanager is running on all three nodes, but service > won't relocate over to node 3. clustat doesn't see rgmanager on it. > Run from nodes 1 and 2, clustat shows all three nodes Online but only > nodes 1 and 2 have rgmanager. Run from node 3, clustat shows all > three Online and no rgmanager. This is what I'd see if rgamanger were > not running on node3 at all. And yet: [...] After I sent that email - and about an hour after the problme first began - node2 spontaneously switched to showing rgmanager="0" in its clustat -x output, even though node2 was where the service was running. After rebooting node3 another time, its clurgmgrd was no longer in the SIGCHLD loop I showed before. Instead, it was blocked on write(7, ... According to lsof, filehandle 7 was /dev/misc/dlm-control On #linux-cluster IRC, lon asked what group_tool ls showed... node1 $sudo group_tool ls type level name id state fence 0 default 00010001 JOIN_STOP_WAIT [1 2 3 3] dlm 1 rgmanager 00030001 JOIN_ALL_STOPPED [1 2 3] node2 $sudo group_tool ls [sudo] password for oinbar: type level name id state fence 0 default 00010001 JOIN_STOP_WAIT [1 2 3 3] dlm 1 rgmanager 00030001 JOIN_ALL_STOPPED [1 2 3] node3 $sudo group_tool ls [sudo] password for oinbar: type level name id state fence 0 default 00000000 JOIN_STOP_WAIT [1 2 3] dlm 1 rgmanager 00000000 JOIN_STOP_WAIT [1 2 3] He also asked me to send SIGUSR1 to clurgmgrd and get the contents of /tmp/rgmanager-dump*, but clurgmgrd did not respond to SIGUSR1 and I got no dump files. Also, I updated the cluster.conf to change <rm log_level="6"> to 7. I started seeing this in /var/log/messages on node3: Aug 18 10:00:07 node3 rgmanager: [8121]: <notice> Shutting down Cluster Service Manager... Aug 18 10:13:31 node3 kernel: dlm: Using TCP for communications Aug 18 10:13:31 node3 dlm_controld[1857]: process_uevent online@ error -17 errno 2 Aug 18 10:14:05 node3 kernel: dlm: rgmanager: group join failed -512 0 Aug 18 10:14:05 node3 kernel: dlm: Using TCP for communications Aug 18 10:14:05 node3 dlm_controld[1857]: process_uevent online@ error -17 errno 2 Aug 18 10:14:33 node3 kernel: dlm: rgmanager: group join failed -512 0 Aug 18 10:14:36 node3 dlm_controld[1857]: process_uevent online@ error -17 errno 2 Aug 18 10:14:36 node3 kernel: dlm: Using TCP for communications Aug 18 10:26:15 node3 rgmanager: [22290]: <notice> Shutting down Cluster Service Manager... Aug 18 10:34:48 node3 kernel: dlm: rgmanager: group join failed -512 0 ... and this in /var/log/messages on node1: Aug 18 10:37:48 node1 kernel: INFO: task clurgmgrd:32606 blocked for more than 120 seconds. Aug 18 10:37:48 node1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 18 10:37:48 node1 kernel: clurgmgrd D ffff81016ae9abc0 0 32606 32605 633 (NOTLB) Aug 18 10:37:48 node1 kernel: ffff810169641de8 0000000000000086 ffff810169641d28 ffff810169641d28 Aug 18 10:37:48 node1 kernel: 0000000000000246 0000000000000008 ffff81006efad820 ffff810168493080 Aug 18 10:37:48 node1 kernel: 0003f21de24fde7f 000000000000f650 ffff81006efada08 000000007eea8300 Aug 18 10:37:48 node1 kernel: Call Trace: Aug 18 10:37:48 node1 kernel: [<ffffffff8002cd2c>] mntput_no_expire+0x19/0x89 Aug 18 10:37:48 node1 kernel: [<ffffffff8000ea75>] link_path_walk+0xa6/0xb2 Aug 18 10:37:48 node1 kernel: [<ffffffff800656ac>] __down_read+0x7a/0x92 Aug 18 10:37:48 node1 kernel: [<ffffffff88473380>] :dlm:dlm_clear_proc_locks+0x20/0x1d2 Aug 18 10:37:48 node1 kernel: [<ffffffff8001adcf>] cp_new_stat+0xe5/0xfd Aug 18 10:37:48 node1 kernel: [<ffffffff8847b0a9>] :dlm:device_close+0x55/0x99 Aug 18 10:37:48 node1 kernel: [<ffffffff80012ac5>] __fput+0xd3/0x1bd Aug 18 10:37:48 node1 kernel: [<ffffffff80023bd1>] filp_close+0x5c/0x64 Aug 18 10:37:48 node1 kernel: [<ffffffff8001dff3>] sys_close+0x88/0xbd Aug 18 10:37:48 node1 kernel: [<ffffffff8005e116>] system_call+0x7e/0x83 Aug 18 10:37:48 node1 kernel: Finally, I rebooted all three cluster nodes at the same time, After I did that, everything came back up in a good state. I'm sending this followup in the hopes that someone can use this data to determine what the bug was. If you do, please reply. Thanks! -- Cos -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster