Re: [Linux-cluster] cluster lost quorum after 11 hours

Patrick Caulfield <pcaulfie@xxxxxxxxxx> · Mon, 14 Feb 2005 15:12:32 +0000

On Fri, Feb 11, 2005 at 04:47:38PM -0800, Daniel McNeil wrote:
> I was running my test on a 3 node cluster and it died
> after 11 hours.  cl030 lost quorum with the other 2 nodes
> kicked out of the cluster.  cl031 also hit a bunch of asserts
> like
>     lock_dlm:  Assertion failed on line 352 of file  
>     /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c
>     lock_dlm:  assertion:  "!error"
>     lock_dlm:  time = 291694516
>     stripefs: error=-22 num=2,19
> I assume is caused by the cluster shutting down.
> 
> 
> /var/log/messages showed:
> 
> cl030:
> Feb 11 02:44:33 cl030 kernel: CMAN: removing node cl032a from the cluster : No response to messages
> Feb 11 02:44:33 cl030 kernel: CMAN: removing node cl031a from the cluster : No response to messages
> Feb 11 02:44:33 cl030 kernel: CMAN: quorum lost, blocking activity
> Feb 11 14:40:33 cl030 sshd(pam_unix)[27323]: session opened for user root by (uid=0)

You should only get nodes dying from "No response to messages" during a state 
transition of some sort (eg a node leaving or joining or possibly a GFS
mount/dismount). In which case the DLM has to do recovery. I recently checked in
a couple of changes that will stop the DLM recovery from taking over the
machine when there are several thousand locks to recover, that might help.

During a normal "steady" state, a node should not die from 
"No response to messages" because the only messages that are being sent are
HELLO heartbeat messages and they are not acked.
-- 

patrick