[Linux-cluster] hung cluster question

Daniel McNeil <daniel@xxxxxxxx> · Fri, 03 Dec 2004 16:58:12 -0800

Another thing I am a bit confused by.  After hitting the
rm hang describe before, I expected that reset one of the
nodes of the cluster would clear up the problem since
recovery should clean up the DLM lock state.

So I reset cl031.  cl030 still had the gfs file system mounted
and cl032 was a member of the cluster, but did not a gfs
file system mounted.

When I reset cl031, both other nodes printed
CMAN: no HELLO from cl031a, removing from the cluster

Since I had configured manual fencing, I expected that I
would see a message on one of the nodes saying I needed
to ack the fencing, but I never saw any message.

After that running, cat /proc/cluster/services hung.

I reset cl031 and cl032 got:
CMAN: no HELLO from cl030a, removing from the cluster
CMAN: quorum lost, blocking activity
SM: 00000001 process_recovery_barrier status=-104

Does the SM: message mean anything.

After rebooting the other 2 nodes, they rejoined the cluster
ok, but there were message vi /var/log/messages :

Dec  3 16:53:44 cl032 fenced[17168]: fencing node "cl030a"
Dec  3 16:53:44 cl032 fenced[17168]: fence "cl030a" failed

So I'm not sure my manual fencing is working correctly.

Any suggestions?

Thanks,

Daniel