Cluster blocked because "waiting for 1 more stopped message"

"Dirk H. Schulz" <dirk.schulz@xxxxxxxxxxxxx> · Sat, 23 Jan 2010 11:31:49 +0100

Hi folks,

testing around with activating/deactivating my cluster logical volumes I 
drove the cluster into a situation where clvmd on one node was stuck, so 
I decided to reboot the node.
This did not work because the kernel could not unmount some file system. 
I had to power it off. So far my fault, I thought.

On the other node group_tool dump gave back:
1264171959 0:default waiting for 1 more stopped messages before 
LEAVE_ALL_STOPPED 2
1264171959 2:XenImages waiting for 1 more stopped messages before 
LEAVE_ALL_STOPPED 2
1264171959 1:XenImages waiting for 1 more stopped messages before 
LEAVE_ALL_STOPPED 2
1264171959 1:clvmd waiting for 1 more stopped messages before 
LEAVE_ALL_STOPPED 2
1264171959 got client 13 dump
And even after rebooting the problem node and restarting cman, clvmd and 
rgmanager, services on the working node were stuck as well with the 
above  messages being shown.

I did not find any way to push the cluster back into working condition 
other than rebooting the working node also. Even a "kill -9" on clvmd 
did not work!

Is there any way to manually fake the awaited "stopped message" to make 
the rest of the cluster go on? There MUST be, because otherwise this 
would kill the concept of a cluster on the whole: waiting for a dead 
nodes last "stopped" message before going on clustering does not make 
much sense to me.

If anyone out there could help me understand why it is implemented that 
way and point me at what to do in such a case, I would be very happy.

Dirk

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster