Re: gfs2_jadd borked my cluster?

<rhurst@xxxxxxxxxxxxxxxxx> · Mon, 22 Nov 2010 10:33:49 -0500

FYI, RHN support has provided no insight to the problem.  We recreated the GFS2 filesystems and can join/use them, but after a few days, they all withdraw at some point.  :(

I suspect a virtio_blk caching issue is causing the problems with GFS2 on KVM guests.  I read in the RHEL 5.6 (beta) release notes that "a caching issue" (generically written as this) was corrected with the virtio_blk module.  And RHEL 6 declares that GFS2 is a supported filesystem no KVM guests -- there is no such written statement anywhere in the RHEL 5 documentation.

However, RHN support wrote back in my ticket that our infrastructure and cluster configuration are supported.  It just doesn't work.  :P

I am going to try the GNBD method for the KVM guests, interestingly, its documentation specifically speaks to a caching issue to _disable_ or it can lead to corruption -- something very similar in lines we are experiencing using virtio_blk to a fiber-channel disk.

Anyone else running KVM guests with or without a physical host mix using GFS2 clustered filesystems?  We'd like to know, thanks.

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Hurst,Robert (BIDMC - Information Systems)
Sent: Wednesday, October 20, 2010 12:51 PM
To: linux-cluster@xxxxxxxxxx
Subject: Re:  gfs2_jadd borked my cluster?

Also, the messages from the failure follow:

Oct 20 12:11:28 watsonapp2 ccsd[3016]: Initial status:: Quorate 
Oct 20 12:11:40 watsonapp2 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "ccc_devtest55:homedt55"
Oct 20 12:11:40 watsonapp2 kernel: dlm: Using TCP for communications
Oct 20 12:11:40 watsonapp2 kernel: dlm: connecting to 8
Oct 20 12:11:40 watsonapp2 kernel: dlm: connecting to 3
Oct 20 12:11:40 watsonapp2 kernel: dlm: connecting to 2
Oct 20 12:11:40 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.3: Joined cluster. Now mounting FS...
Oct 20 12:11:40 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.3: can't mount journal #3
Oct 20 12:11:40 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.3: there are only 3 journals (0 - 2)
Oct 20 12:17:03 watsonapp2 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "ccc_devtest55:homedt55"
Oct 20 12:17:03 watsonapp2 kernel: dlm: Using TCP for communications
Oct 20 12:17:03 watsonapp2 kernel: dlm: connecting to 3
Oct 20 12:17:03 watsonapp2 kernel: dlm: connecting to 2
Oct 20 12:17:03 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: Joined cluster. Now mounting FS...
Oct 20 12:17:04 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: jid=0, already locked for use
Oct 20 12:17:04 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: jid=0: Looking at journal...
Oct 20 12:17:04 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: jid=0: Done
Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: fatal: filesystem consistency error
Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0:   RG = 458777
Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0:   function = gfs2_setbit, file = fs/gfs2/rgrp.c, line = 97
Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: about to withdraw this file system
Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: telling LM to withdraw
Oct 20 12:17:18 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: withdrawn
Oct 20 12:17:18 watsonapp2 kernel: 
Oct 20 12:17:18 watsonapp2 kernel: Call Trace:
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884b543e>] :gfs2:gfs2_lm_withdraw+0xd1/0xfe
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff80013b19>] find_lock_page+0x26/0xa2
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff80025c06>] find_or_create_page+0x22/0x72
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884b72d2>] :gfs2:__glock_lo_add+0x62/0x89
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884c8ae3>] :gfs2:gfs2_consist_rgrpd_i+0x34/0x39
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884c555f>] :gfs2:rgblk_free+0x13a/0x15c
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884c5801>] :gfs2:gfs2_unlink_di+0x25/0x60
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884b3be9>] :gfs2:gfs2_change_nlink+0xf8/0x102
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884bfa8b>] :gfs2:gfs2_rename+0x470/0x652
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884bf71b>] :gfs2:gfs2_rename+0x100/0x652
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884bf73c>] :gfs2:gfs2_rename+0x121/0x652
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884bf761>] :gfs2:gfs2_rename+0x146/0x652
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884bf786>] :gfs2:gfs2_rename+0x16b/0x652
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884bf7b9>] :gfs2:gfs2_rename+0x19e/0x652
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff80030c69>] d_splice_alias+0xdc/0xfb
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff8000d9d8>] permission+0x81/0xc8
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff8002a9ec>] vfs_rename+0x2f4/0x471
Oct 20 12:17:19 watsonapp2 kernel:  [<ffffffff80036be0>] sys_renameat+0x180/0x1eb
Oct 20 12:17:19 watsonapp2 kernel:  [<ffffffff80066b88>] do_page_fault+0x4fe/0x874
Oct 20 12:17:19 watsonapp2 kernel:  [<ffffffff800b7649>] audit_syscall_entry+0x180/0x1b3
Oct 20 12:17:19 watsonapp2 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Oct 20 12:17:19 watsonapp2 kernel: 

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Hurst,Robert (BIDMC - Information Systems)
Sent: Wednesday, October 20, 2010 12:41 PM
To: linux-cluster@xxxxxxxxxx
Subject:  gfs2_jadd borked my cluster?

Latest RHEL 5u5 with a four node cluster:

cman-2.0.115-34.el5_5.3
gfs2-utils-0.1.62-20.el5
kernel-2.6.18-194.17.1.el5

Three nodes are blades; the fourth is a KVM guest.

I executed `gfs2_jadd -j1 /home` to add a fourth journal; it completely successfully with old=3, new=4 message.  I checked on all three nodes with `gfs2_tool journals /home` and they all reported four journals of size 128MB.

I joined KVM guest to cluster.  I attempted to mount /home and it complained there were only three journals.  EH???  So, I umount /home on a blade and mount /home on the KVM guest -- it allowed it to mount.

Checking journals on all hosts again, they now report only 3.

I umount /home on KVM guest, and re-mounted it on the blade.  It, too, only reports 3 journals now.

I repeated process again, but second time around, I got a GFS2 filesystem withdrawal dump on the guest.  And now the DLM has got that channel locked on all nodes with a LEAVE_STOP_WAIT status.  I tried fence_node against the guest, it re-booted the node fine, but now DLM fence is locked with a FAIL_ALL_STOPPED status.

1) Can I clear this issue (obviously without re-booting)?

2) What could possibly have gone wrong with gfs2_jadd?

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster