Re: SAN with GFS2 on RHEL 6 beta: STONITH right after start

Andrew Beekhof <andrew@xxxxxxxxxxx> · Thu, 29 Jul 2010 11:01:09 +0200

On Wed, Jul 28, 2010 at 9:44 PM, Köppel  Benedikt (LET)
<benedikt.koeppel@xxxxxxxxxxx> wrote:
> OK, with the help of Andrew, I tried it again.
>
> Some important logs from the problem:
>
> ~snip~
>
> 1317 Jul 28 00:46:31 pcmknode-1 corosync[2618]: [TOTEM ] A processor failed, forming new configuration.
> 1318 Jul 28 00:46:32 pcmknode-1 kernel: dlm: closing connection to node -1147763583
> 1319 Jul 28 00:46:32 pcmknode-1 corosync[2618]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 624: memb=1, new=0, lost=1
> 1320 Jul 28 00:46:32 pcmknode-1 corosync[2618]: [pcmk ] info: pcmk_peer_update: memb: pcmknode-1 3130426497
> 1321 Jul 28 00:46:32 pcmknode-1 corosync[2618]: [pcmk ] info: pcmk_peer_update: lost: pcmknode-2 3147203713
>
> ~snip~
>
> 1338 Jul 28 00:46:32 pcmknode-1 crmd: [2629]: info: crm_update_peer: Node pcmknode-2: id=3147203713 state=lost (new) addr=r(0) ip(192.168.150.187) votes=1 born=620 seen=620 proc=00000000000000000000000000111312
> 1339 Jul 28 00:46:32 pcmknode-1 crmd: [2629]: info: erase_node_from_join: Removed node pcmknode-2 from join calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1
> 1340 Jul 28 00:46:32 pcmknode-1 crmd: [2629]: info: crm_update_quorum: Updating quorum status to false (call=45)
>
> ~snip~
>
> 1351 Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: pe_fence_node: Node pcmknode-2 will be fenced because it is un-expectedly down
> 1352 Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: determine_online_status_fencing: ha_state=active, ccm_state=false, crm_state=online, join_state=member, expected=member
> 1353 Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: determine_online_status: Node pcmknode-2 is unclean
>
> ~snip~
>
>
>
>
> I then removed the LVM from /dev/sdb2 and created the GFS2 right on /dev/sdb2
> (without LVM). That does not solve the problem.
>
> Starting corosync on only one node works fine, even the GFS2 disk can get
> mounted. But as soon as the GFS2 disk will be mounted on the second node, the
> node gets fenced immediately. I set WebFSClone to target Stopped, and as soon
> as I manually started it again, the node got fenced.
> Manually mounting the GFS2 disk (with mount -t gfs2...) on the second node also
> causes the STONITH.
>
> One word about my STONITH: It is SBD which runs via /dev/sdb1. I got the
> cluster-glue SRPM from clusterlabs and extracted it, to manually compile SBD
> (only SBD, nothing else). I then installed SBD. So my system runs all packages
> from those repositories: RHEL 6 beta, EPEL, Clusterlabs.
>
>
>
> I monitored the network traffic with tcpdump and analyzed it afterwards with
> Wireshark. The two DLMs are communicating, but I don't know if probably
> something goes wrong there. I see there a packet going from pcmknode-2 to
> pcmknode-1, with that content (by Wireshark): (some lines omitted which I think
> are not interesting, can provide them if needed)
> Command: message (1)
> Message Type: lookup message (11)
> External Flags: 0x08, Return the contents of the lock value block
> Status: Unknown (0)
> Granted Mode: invalid (-1)
> Request Mode: exclusive (5)
>
> And then the response from pcmknode-1 to pcmknode-2:
> Command: message (1)
> Message Type: request reply (5)
> External Flags: 0x08, Return the contents of the lock value block
> Status: granted (2)
> Granted Mode: exclusive (5)
> Request Mode: invalid (-1)
>
> I wonder why pcmknode-1 says "Granted: exclusive" to pcmknode-2.

No idea, I don't have much to do with the DLM.

> Immediately after the request reply, the pcmknode-2 writes to log "Now mounting
> FS..." and gets fenced and shut down.

As I explained on IRC yesterday, the node getting fenced is not the issue here.

For some reason mounting the GFS volume is causing the node to fail -
this is the root cause.
Almost all the gfs and dlm code is shared between cman and pacemaker -
so its quite possible that the dlm has an issue.

Perhaps file a bug against the dlm.
>
>
> So, is there probably something wrong with the DLM?
>
>
>
> Regards,
> Benedikt
>
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster