FYI, RHN support has provided no insight to the problem. We recreated the GFS2 filesystems and can join/use them, but after a few days, they all withdraw at some point. :( I suspect a virtio_blk caching issue is causing the problems with GFS2 on KVM guests. I read in the RHEL 5.6 (beta) release notes that "a caching issue" (generically written as this) was corrected with the virtio_blk module. And RHEL 6 declares that GFS2 is a supported filesystem no KVM guests -- there is no such written statement anywhere in the RHEL 5 documentation. However, RHN support wrote back in my ticket that our infrastructure and cluster configuration are supported. It just doesn't work. :P I am going to try the GNBD method for the KVM guests, interestingly, its documentation specifically speaks to a caching issue to _disable_ or it can lead to corruption -- something very similar in lines we are experiencing using virtio_blk to a fiber-channel disk. Anyone else running KVM guests with or without a physical host mix using GFS2 clustered filesystems? We'd like to know, thanks. -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Hurst,Robert (BIDMC - Information Systems) Sent: Wednesday, October 20, 2010 12:51 PM To: linux-cluster@xxxxxxxxxx Subject: Re: gfs2_jadd borked my cluster? Also, the messages from the failure follow: Oct 20 12:11:28 watsonapp2 ccsd[3016]: Initial status:: Quorate Oct 20 12:11:40 watsonapp2 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "ccc_devtest55:homedt55" Oct 20 12:11:40 watsonapp2 kernel: dlm: Using TCP for communications Oct 20 12:11:40 watsonapp2 kernel: dlm: connecting to 8 Oct 20 12:11:40 watsonapp2 kernel: dlm: connecting to 3 Oct 20 12:11:40 watsonapp2 kernel: dlm: connecting to 2 Oct 20 12:11:40 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.3: Joined cluster. Now mounting FS... Oct 20 12:11:40 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.3: can't mount journal #3 Oct 20 12:11:40 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.3: there are only 3 journals (0 - 2) Oct 20 12:17:03 watsonapp2 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "ccc_devtest55:homedt55" Oct 20 12:17:03 watsonapp2 kernel: dlm: Using TCP for communications Oct 20 12:17:03 watsonapp2 kernel: dlm: connecting to 3 Oct 20 12:17:03 watsonapp2 kernel: dlm: connecting to 2 Oct 20 12:17:03 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: Joined cluster. Now mounting FS... Oct 20 12:17:04 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: jid=0, already locked for use Oct 20 12:17:04 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: jid=0: Looking at journal... Oct 20 12:17:04 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: jid=0: Done Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: fatal: filesystem consistency error Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: RG = 458777 Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: function = gfs2_setbit, file = fs/gfs2/rgrp.c, line = 97 Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: about to withdraw this file system Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: telling LM to withdraw Oct 20 12:17:18 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: withdrawn Oct 20 12:17:18 watsonapp2 kernel: Oct 20 12:17:18 watsonapp2 kernel: Call Trace: Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff884b543e>] :gfs2:gfs2_lm_withdraw+0xd1/0xfe Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff80013b19>] find_lock_page+0x26/0xa2 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff80025c06>] find_or_create_page+0x22/0x72 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff884b72d2>] :gfs2:__glock_lo_add+0x62/0x89 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff884c8ae3>] :gfs2:gfs2_consist_rgrpd_i+0x34/0x39 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff884c555f>] :gfs2:rgblk_free+0x13a/0x15c Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff884c5801>] :gfs2:gfs2_unlink_di+0x25/0x60 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff884b3be9>] :gfs2:gfs2_change_nlink+0xf8/0x102 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff884bfa8b>] :gfs2:gfs2_rename+0x470/0x652 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff884bf71b>] :gfs2:gfs2_rename+0x100/0x652 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff884bf73c>] :gfs2:gfs2_rename+0x121/0x652 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff884bf761>] :gfs2:gfs2_rename+0x146/0x652 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff884bf786>] :gfs2:gfs2_rename+0x16b/0x652 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff884bf7b9>] :gfs2:gfs2_rename+0x19e/0x652 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff80030c69>] d_splice_alias+0xdc/0xfb Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff8000d9d8>] permission+0x81/0xc8 Oct 20 12:17:18 watsonapp2 kernel: [<ffffffff8002a9ec>] vfs_rename+0x2f4/0x471 Oct 20 12:17:19 watsonapp2 kernel: [<ffffffff80036be0>] sys_renameat+0x180/0x1eb Oct 20 12:17:19 watsonapp2 kernel: [<ffffffff80066b88>] do_page_fault+0x4fe/0x874 Oct 20 12:17:19 watsonapp2 kernel: [<ffffffff800b7649>] audit_syscall_entry+0x180/0x1b3 Oct 20 12:17:19 watsonapp2 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Oct 20 12:17:19 watsonapp2 kernel: -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Hurst,Robert (BIDMC - Information Systems) Sent: Wednesday, October 20, 2010 12:41 PM To: linux-cluster@xxxxxxxxxx Subject: gfs2_jadd borked my cluster? Latest RHEL 5u5 with a four node cluster: cman-2.0.115-34.el5_5.3 gfs2-utils-0.1.62-20.el5 kernel-2.6.18-194.17.1.el5 Three nodes are blades; the fourth is a KVM guest. I executed `gfs2_jadd -j1 /home` to add a fourth journal; it completely successfully with old=3, new=4 message. I checked on all three nodes with `gfs2_tool journals /home` and they all reported four journals of size 128MB. I joined KVM guest to cluster. I attempted to mount /home and it complained there were only three journals. EH??? So, I umount /home on a blade and mount /home on the KVM guest -- it allowed it to mount. Checking journals on all hosts again, they now report only 3. I umount /home on KVM guest, and re-mounted it on the blade. It, too, only reports 3 journals now. I repeated process again, but second time around, I got a GFS2 filesystem withdrawal dump on the guest. And now the DLM has got that channel locked on all nodes with a LEAVE_STOP_WAIT status. I tried fence_node against the guest, it re-booted the node fine, but now DLM fence is locked with a FAIL_ALL_STOPPED status. 1) Can I clear this issue (obviously without re-booting)? 2) What could possibly have gone wrong with gfs2_jadd? -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster