This recently happened to us and I'm wondering if there's anything else we can do to prevent it and/or more fully recover from it. 3 physical nodes, 64-bit, 2.6.18-128.2.1.el5xen, 3 GFS2 filesystems mounted by each. A zero-byte file in a subdir about 3 levels deep that when touched in any way causes total meltdown. Details below... We took the filesystem offline (all nodes) and ran gfs2_fsck against it. The FS is 6.2TB in size, living on 2gb/sec fibrechannel array. It took 9 hours to complete which was not as bad as I had feared it would be. Afterwards the filesystem was remounted and that zero-byte file was attempted to be removed and the same thing happened again. So it appears gfs2_fsck did not fix it. Since there are 3 GFS filesystems the bad part was that access to all 3 went away when one of them had an issue because GFS itself appears to have crashed. That's the part I don't understand and am pretty sure was not what should have happened. After a full reboot we renamed the directory holding the "bad" zero-byte file to a directory in the root of that GFS filesystem and are simply avoiding it at this point. Thanks... Description from a co-worker on what he found from researching this: While hitting something on the filesystem, it runs in to an invalid metadata block, realizes the error & problem, and attempts to take the FS offline because it's bad. (To not risk additional corruption) Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: fatal: invalid metadata block Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: bh = 1633350398 (magic number) Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: function = gfs2_meta_indirect_buffer, file = /builddir/build/BUILD/gfs2-kmod-1.92/_kmod_build_xen/meta_io.c, line = 33 4 Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: about to withdraw this file system Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: telling LM to withdraw Jul 22 04:11:57 srvname kernel: GFS2: fsid=cluname:raid1.1: withdrawn Jul 22 04:11:57 srvname kernel: Unfortunately... For some reason, when it completes the withdrawal process, gfs crashes... I'm sure it's not supposed to do that... It should continue allowing access to all of the other GFS filesystems, but since the gfs module is dieing, it kills access to any gfs filesystems. Jul 22 04:11:57 srvname kernel: Call Trace: Jul 22 04:11:57 srvname kernel: [<ffffffff8854891a>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 Jul 22 04:11:57 srvname kernel: [<ffffffff80262907>] __wait_on_bit+0x60/0x6e Jul 22 04:11:57 srvname kernel: [<ffffffff80215788>] sync_buffer+0x0/0x3f Jul 22 04:11:57 srvname kernel: [<ffffffff80262981>] out_of_line_wait_on_bit+0x6c/0x78 Jul 22 04:11:57 srvname kernel: [<ffffffff8029a01a>] wake_bit_function+0x0/0x23 Jul 22 04:11:57 srvname kernel: [<ffffffff8021a7f1>] submit_bh+0x10a/0x111 Jul 22 04:11:57 srvname kernel: [<ffffffff8855a627>] :gfs2:gfs2_meta_check_ii+0x2c/0x38 Jul 22 04:11:57 srvname kernel: [<ffffffff8854c168>] :gfs2:gfs2_meta_indirect_buffer+0x104/0x160 Jul 22 04:11:57 srvname kernel: [<ffffffff8853c786>] :gfs2:recursive_scan+0x96/0x175 Jul 22 04:11:57 srvname kernel: [<ffffffff8853c82c>] :gfs2:recursive_scan+0x13c/0x175 Jul 22 04:11:57 srvname kernel: [<ffffffff8853d65a>] :gfs2:do_strip+0x0/0x358 Jul 22 04:11:57 srvname kernel: [<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14 Jul 22 04:11:57 srvname kernel: [<ffffffff8853c8fe>] :gfs2:trunc_dealloc+0x99/0xe7 Jul 22 04:11:57 srvname kernel: [<ffffffff8853d65a>] :gfs2:do_strip+0x0/0x358 Jul 22 04:11:57 srvname kernel: [<ffffffff88545149>] :gfs2:gfs2_glock_dq+0x1e/0x132 Jul 22 04:11:57 srvname kernel: [<ffffffff8020b7bf>] kfree+0x15/0xc5 Jul 22 04:11:57 srvname kernel: [<ffffffff8853df97>] :gfs2:gfs2_truncatei+0x5e5/0x70d Jul 22 04:11:57 srvname kernel: [<ffffffff88545149>] :gfs2:gfs2_glock_dq+0x1e/0x132 Jul 22 04:11:57 srvname kernel: [<ffffffff88544b28>] :gfs2:gfs2_glock_put+0x1a/0xe2 Jul 22 04:11:57 srvname kernel: [<ffffffff88550b83>] :gfs2:gfs2_setattr+0xe6/0x335 Jul 22 04:11:57 srvname kernel: [<ffffffff88550acd>] :gfs2:gfs2_setattr+0x30/0x335 Jul 22 04:11:57 srvname kernel: [<ffffffff8026349f>] __down_write_nested+0x35/0x9a Jul 22 04:11:57 srvname kernel: [<ffffffff8022caf2>] notify_change+0x145/0x2e0 Jul 22 04:11:57 srvname kernel: [<ffffffff802ce6ae>] do_truncate+0x5e/0x79 Jul 22 04:11:57 srvname kernel: [<ffffffff8020db96>] permission+0x81/0xc8 Jul 22 04:11:57 srvname kernel: [<ffffffff80212b01>] may_open+0x1d3/0x22e Jul 22 04:11:57 srvname kernel: [<ffffffff8021b1c2>] open_namei+0x2c4/0x6d5 Jul 22 04:11:57 srvname kernel: [<ffffffff802275c9>] do_filp_open+0x1c/0x38 Jul 22 04:11:57 srvname kernel: [<ffffffff80219d14>] do_sys_open+0x44/0xbe Jul 22 04:11:57 srvname kernel: [<ffffffff8025f2f9>] tracesys+0xab/0xb6 Jul 22 04:11:57 srvname kernel: Tail end of the gfs2_fsck run: Ondisk and fsck bitmaps differ at block 1633350906 (0x615af4fa) Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Succeeded. Ondisk and fsck bitmaps differ at block 1633350907 (0x615af4fb) Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Succeeded. RG #1633288608 (0x615a01a0) free count inconsistent: is 48931 should be 49441 Resource group counts updated Pass5 complete Writing changes to disk gfs2_fsck complete PS. Access to this file which we know has caused us to crash for sure: mv file1 file2 and rm -f file1 and echo "asdf" >file1 -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster