----- "Wendell Dingus" <wendell@xxxxxxxxxxxxx> wrote: | This recently happened to us and I'm wondering if there's anything | else we can do to prevent it and/or more fully recover from it. | | 3 physical nodes, 64-bit, 2.6.18-128.2.1.el5xen, 3 GFS2 filesystems | mounted by each. | | A zero-byte file in a subdir about 3 levels deep that when touched in | any way causes total meltdown. Details below... | | We took the filesystem offline (all nodes) and ran gfs2_fsck against | it. The FS is 6.2TB in size, living on 2gb/sec fibrechannel array. It | took 9 hours to complete which was not as bad as I had feared it would | be. Afterwards the filesystem was remounted and that zero-byte file | was attempted to be removed and the same thing happened again. So it | appears gfs2_fsck did not fix it. Since there are 3 GFS filesystems | the bad part was that access to all 3 went away when one of them had | an issue because GFS itself appears to have crashed. That's the part I | don't understand and am pretty sure was not what should have | happened. | | After a full reboot we renamed the directory holding the "bad" | zero-byte file to a directory in the root of that GFS filesystem and | are simply avoiding it at this point. | | Thanks... | | Description from a co-worker on what he found from researching this: | | While hitting something on the filesystem, it runs in to an invalid | metadata block, realizes the error & problem, and attempts to take the | FS offline because it's bad. (To not risk additional corruption) | | Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: fatal: | invalid metadata block | Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: bh = | 1633350398 (magic number) | Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: function = | gfs2_meta_indirect_buffer, file = | /builddir/build/BUILD/gfs2-kmod-1.92/_kmod_build_xen/meta_io.c, line = | 33 | 4 | Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: about to | withdraw this file system | Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: telling LM | to withdraw | Jul 22 04:11:57 srvname kernel: GFS2: fsid=cluname:raid1.1: withdrawn | | Jul 22 04:11:57 srvname kernel: | | Unfortunately... For some reason, when it completes the withdrawal | process, gfs crashes... I'm sure it's not supposed to do that... It | should continue allowing access to all of the other GFS filesystems, | but since the gfs module is dieing, it kills access to any gfs | filesystems. | | Jul 22 04:11:57 srvname kernel: Call Trace: | Jul 22 04:11:57 srvname kernel: [<ffffffff8854891a>] | :gfs2:gfs2_lm_withdraw+0xc1/0xd0 | Jul 22 04:11:57 srvname kernel: [<ffffffff80262907>] | __wait_on_bit+0x60/0x6e | Jul 22 04:11:57 srvname kernel: [<ffffffff80215788>] | sync_buffer+0x0/0x3f | Jul 22 04:11:57 srvname kernel: [<ffffffff80262981>] | out_of_line_wait_on_bit+0x6c/0x78 | Jul 22 04:11:57 srvname kernel: [<ffffffff8029a01a>] | wake_bit_function+0x0/0x23 | Jul 22 04:11:57 srvname kernel: [<ffffffff8021a7f1>] | submit_bh+0x10a/0x111 | Jul 22 04:11:57 srvname kernel: [<ffffffff8855a627>] | :gfs2:gfs2_meta_check_ii+0x2c/0x38 | Jul 22 04:11:57 srvname kernel: [<ffffffff8854c168>] | :gfs2:gfs2_meta_indirect_buffer+0x104/0x160 | Jul 22 04:11:57 srvname kernel: [<ffffffff8853c786>] | :gfs2:recursive_scan+0x96/0x175 | Jul 22 04:11:57 srvname kernel: [<ffffffff8853c82c>] | :gfs2:recursive_scan+0x13c/0x175 | Jul 22 04:11:57 srvname kernel: [<ffffffff8853d65a>] | :gfs2:do_strip+0x0/0x358 | Jul 22 04:11:57 srvname kernel: [<ffffffff802639f9>] | _spin_lock_irqsave+0x9/0x14 | Jul 22 04:11:57 srvname kernel: [<ffffffff8853c8fe>] | :gfs2:trunc_dealloc+0x99/0xe7 | Jul 22 04:11:57 srvname kernel: [<ffffffff8853d65a>] | :gfs2:do_strip+0x0/0x358 | Jul 22 04:11:57 srvname kernel: [<ffffffff88545149>] | :gfs2:gfs2_glock_dq+0x1e/0x132 | Jul 22 04:11:57 srvname kernel: [<ffffffff8020b7bf>] kfree+0x15/0xc5 | Jul 22 04:11:57 srvname kernel: [<ffffffff8853df97>] | :gfs2:gfs2_truncatei+0x5e5/0x70d | Jul 22 04:11:57 srvname kernel: [<ffffffff88545149>] | :gfs2:gfs2_glock_dq+0x1e/0x132 | Jul 22 04:11:57 srvname kernel: [<ffffffff88544b28>] | :gfs2:gfs2_glock_put+0x1a/0xe2 | Jul 22 04:11:57 srvname kernel: [<ffffffff88550b83>] | :gfs2:gfs2_setattr+0xe6/0x335 | Jul 22 04:11:57 srvname kernel: [<ffffffff88550acd>] | :gfs2:gfs2_setattr+0x30/0x335 | Jul 22 04:11:57 srvname kernel: [<ffffffff8026349f>] | __down_write_nested+0x35/0x9a | Jul 22 04:11:57 srvname kernel: [<ffffffff8022caf2>] | notify_change+0x145/0x2e0 | Jul 22 04:11:57 srvname kernel: [<ffffffff802ce6ae>] | do_truncate+0x5e/0x79 | Jul 22 04:11:57 srvname kernel: [<ffffffff8020db96>] | permission+0x81/0xc8 | Jul 22 04:11:57 srvname kernel: [<ffffffff80212b01>] | may_open+0x1d3/0x22e | Jul 22 04:11:57 srvname kernel: [<ffffffff8021b1c2>] | open_namei+0x2c4/0x6d5 | Jul 22 04:11:57 srvname kernel: [<ffffffff802275c9>] | do_filp_open+0x1c/0x38 | Jul 22 04:11:57 srvname kernel: [<ffffffff80219d14>] | do_sys_open+0x44/0xbe | Jul 22 04:11:57 srvname kernel: [<ffffffff8025f2f9>] | tracesys+0xab/0xb6 | Jul 22 04:11:57 srvname kernel: | | Tail end of the gfs2_fsck run: | Ondisk and fsck bitmaps differ at block 1633350906 (0x615af4fa) | Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) | Metadata type is 0 (free) | Succeeded. | Ondisk and fsck bitmaps differ at block 1633350907 (0x615af4fb) | Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) | Metadata type is 0 (free) | Succeeded. | RG #1633288608 (0x615a01a0) free count inconsistent: is 48931 should | be 49441 | Resource group counts updated | Pass5 complete | Writing changes to disk | gfs2_fsck complete | | PS. Access to this file which we know has caused us to crash for sure: | mv file1 file2 and rm -f file1 and echo "asdf" >file1 Hi Wendell, The fsck.gfs2 program should fix this kind of error, but there are some known fsck bugs. I've been working on a big fix for fsck.gfs and fsck.gfs2 lately that solves many problems, and there is a chance it will solve yours. If you don't mind, perhaps you can send me a copy of your file system metadata, and I will run my latest-and-greatest against it to see if it will detect and fix the problem. Even if it doesn't, perhaps I can even adjust my patch to fix your file system. Regards, Bob Peterson Red Hat File Systems -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster