Kernel messages causing node to be fenced out - Bug?

"Coman Iliut" <comaniliut@xxxxxxxxx> · Wed, 24 Jan 2007 15:51:56 -0500

Hi,

We have a setup with two HP DL360 nodes connected to an MSA500 disk array via SCSI cables. We are running RH4U3 and our product has an active passive design. The Active-passive is managed internally in the product.

Every now and then one of the nodes outputs the below kernel messages after which the other node fences it out. This causes a failover for our product.

Jan 19 13:38:58 n1 kernel: FS1 move flags 0,1,0 ids 0,2,0

Jan 19 13:38:58 n1 kernel: FS1 move use event 2
Jan 19 13:38:58 n1 kernel: FS1 recover event 2 (first)
Jan 19 13:38:58 n1 kernel: FS1 add nodes
Jan 19 13:38:58 n1 kernel: FS1 total nodes 1
Jan 19 13:38:58 n1 kernel: FS1 rebuild resource directory

Jan 19 13:38:58 n1 kernel: FS1 rebuilt 0 resources
Jan 19 13:38:58 n1 kernel: FS1 recover event 2 done
Jan 19 13:38:58 n1 kernel: FS1 move flags 0,0,1 ids 0,2,2
Jan 19 13:38:58 n1 kernel: FS1 process held requests

Jan 19 13:38:58 n1 kernel: FS1 processed 0 requests
Jan 19 13:38:58 n1 kernel: FS1 recover event 2 finished
Jan 19 13:38:58 n1 kernel: FS1 move flags 1,0,0 ids 2,2,2
Jan 19 13:38:58 n1 kernel: FS1 move flags 0,1,0 ids 2,5,2

Jan 19 13:38:58 n1 kernel: FS1 move use event 5
Jan 19 13:38:58 n1 kernel: FS1 recover event 5
Jan 19 13:38:58 n1 kernel: FS1 add node 2
Jan 19 13:38:58 n1 kernel: FS1 total nodes 2
Jan 19 13:38:58 n1 kernel: FS1 rebuild resource directory

Jan 19 13:38:58 n1 kernel: FS1 rebuilt 7409 resources
Jan 19 13:38:58 n1 kernel: FS1 purge requests
Jan 19 13:38:58 n1 kernel: FS1 purged 0 requests
Jan 19 13:38:58 n1 kernel: FS1 mark waiting requests
Jan 19 13:38:58 n1 kernel: FS1 marked 0 requests

Jan 19 13:38:58 n1 kernel: FS1 recover event 5 done
Jan 19 13:38:58 n1 kernel: FS1 move flags 0,0,1 ids 2,5,5
Jan 19 13:38:58 n1 kernel: FS1 process held requests
Jan 19 13:38:58 n1 kernel: FS1 processed 0 requests

Jan 19 13:38:58 n1 kernel: FS1 resend marked requests
Jan 19 13:38:58 n1 kernel: FS1 resent 0 requests
Jan 19 13:38:58 n1 kernel: FS1 recover event 5 finished
Jan 19 13:38:58 n1 kernel: FS1 send einval to 2

Jan 19 13:38:58 n1 kernel: FS1 send einval to 2
Jan 19 13:38:58 n1 kernel: FS1 unlock ff9b0297 no id
Jan 19 13:38:59 n1 kernel:  -2
Jan 19 13:38:59 n1 kernel: 2712 en punlock 7,3019aa2
Jan 19 13:38:59 n1 kernel: 2712 ex punlock -2

Jan 19 13:38:59 n1 kernel: 2712 en punlock 7,3019aa2
Jan 19 13:38:59 n1 kernel: 2712 ex punlock -2
Jan 19 13:38:59 n1 kernel: 2712 en punlock 7,3019aa2
Jan 19 13:38:59 n1 kernel: 2712 ex punlock -2
Jan 19 13:38:59 n1 kernel: 2712 en punlock 7,3019aa2

Then the other node says "missed too many heartbeats" and fences it out. it does some minor recovery work and is all fine.

Is this a bug? The two nodes don't seem to do much at the time when this happens.

We have seen this on another similar setup (2 DL360, MSA500). It seems to happen quite regularly.

I remember I saw a mention about something similar on a mailing list and Patrick Caulfield answered:

If you're running the cman from RHEL4 Update 3 then there's a bug in there you might be hitting.

You'll need to upgrade all the nodes in the cluster to get rid of it. I can't tell for sure 
if it is that problem you're having without seeing more kernel messages though.
http://www.spinics.net/lists/cluster/msg07016.html

Any ideas?

Thanks.

-- 
Coman ILIUT

Mitel Networks
Ottawa, ON

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster