Hi folks,
testing around with activating/deactivating my cluster logical volumes I
drove the cluster into a situation where clvmd on one node was stuck, so
I decided to reboot the node.
This did not work because the kernel could not unmount some file system.
I had to power it off. So far my fault, I thought.
On the other node group_tool dump gave back:
1264171959 0:default waiting for 1 more stopped messages before
LEAVE_ALL_STOPPED 2
1264171959 2:XenImages waiting for 1 more stopped messages before
LEAVE_ALL_STOPPED 2
1264171959 1:XenImages waiting for 1 more stopped messages before
LEAVE_ALL_STOPPED 2
1264171959 1:clvmd waiting for 1 more stopped messages before
LEAVE_ALL_STOPPED 2
1264171959 got client 13 dump
And even after rebooting the problem node and restarting cman, clvmd and
rgmanager, services on the working node were stuck as well with the
above messages being shown.
I did not find any way to push the cluster back into working condition
other than rebooting the working node also. Even a "kill -9" on clvmd
did not work!
Is there any way to manually fake the awaited "stopped message" to make
the rest of the cluster go on? There MUST be, because otherwise this
would kill the concept of a cluster on the whole: waiting for a dead
nodes last "stopped" message before going on clustering does not make
much sense to me.
If anyone out there could help me understand why it is implemented that
way and point me at what to do in such a case, I would be very happy.
Dirk
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster