Re: GFS2 corruption/withdrawal/crash

Bob Peterson <rpeterso@xxxxxxxxxx> · Mon, 27 Jul 2009 14:10:42 -0400 (EDT)

----- "Wendell Dingus" <wendell@xxxxxxxxxxxxx> wrote:
| This recently happened to us and I'm wondering if there's anything
| else we can do to prevent it and/or more fully recover from it.
| 
| 3 physical nodes, 64-bit, 2.6.18-128.2.1.el5xen, 3 GFS2 filesystems
| mounted by each. 
| 
| A zero-byte file in a subdir about 3 levels deep that when touched in
| any way causes total meltdown. Details below...
| 
| We took the filesystem offline (all nodes) and ran gfs2_fsck against
| it. The FS is 6.2TB in size, living on 2gb/sec fibrechannel array. It
| took 9 hours to complete which was not as bad as I had feared it would
| be. Afterwards the filesystem was remounted and that zero-byte file
| was attempted to be removed and the same thing happened again. So it
| appears gfs2_fsck did not fix it. Since there are 3 GFS filesystems
| the bad part was that access to all 3 went away when one of them had
| an issue because GFS itself appears to have crashed. That's the part I
| don't understand and am pretty sure was not what should have
| happened.
| 
| After a full reboot we renamed the directory holding the "bad"
| zero-byte file to a directory in the root of that GFS filesystem and
| are simply avoiding it at this point. 
| 
| Thanks...
| 
| Description from a co-worker on what he found from researching this:
| 
| While hitting something on the filesystem, it runs in to an invalid
| metadata block, realizes the error & problem, and attempts to take the
| FS offline because it's bad. (To not risk additional corruption) 
| 
| Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: fatal:
| invalid metadata block 
| Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: bh =
| 1633350398 (magic number) 
| Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: function =
| gfs2_meta_indirect_buffer, file =
| /builddir/build/BUILD/gfs2-kmod-1.92/_kmod_build_xen/meta_io.c, line =
| 33 
| 4 
| Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: about to
| withdraw this file system 
| Jul 22 04:11:48 srvname kernel: GFS2: fsid=cluname:raid1.1: telling LM
| to withdraw 
| Jul 22 04:11:57 srvname kernel: GFS2: fsid=cluname:raid1.1: withdrawn
| 
| Jul 22 04:11:57 srvname kernel: 
| 
| Unfortunately... For some reason, when it completes the withdrawal
| process, gfs crashes... I'm sure it's not supposed to do that... It
| should continue allowing access to all of the other GFS filesystems,
| but since the gfs module is dieing, it kills access to any gfs
| filesystems. 
| 
| Jul 22 04:11:57 srvname kernel: Call Trace: 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8854891a>]
| :gfs2:gfs2_lm_withdraw+0xc1/0xd0 
| Jul 22 04:11:57 srvname kernel: [<ffffffff80262907>]
| __wait_on_bit+0x60/0x6e 
| Jul 22 04:11:57 srvname kernel: [<ffffffff80215788>]
| sync_buffer+0x0/0x3f 
| Jul 22 04:11:57 srvname kernel: [<ffffffff80262981>]
| out_of_line_wait_on_bit+0x6c/0x78 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8029a01a>]
| wake_bit_function+0x0/0x23 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8021a7f1>]
| submit_bh+0x10a/0x111 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8855a627>]
| :gfs2:gfs2_meta_check_ii+0x2c/0x38 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8854c168>]
| :gfs2:gfs2_meta_indirect_buffer+0x104/0x160 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8853c786>]
| :gfs2:recursive_scan+0x96/0x175 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8853c82c>]
| :gfs2:recursive_scan+0x13c/0x175 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8853d65a>]
| :gfs2:do_strip+0x0/0x358 
| Jul 22 04:11:57 srvname kernel: [<ffffffff802639f9>]
| _spin_lock_irqsave+0x9/0x14 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8853c8fe>]
| :gfs2:trunc_dealloc+0x99/0xe7 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8853d65a>]
| :gfs2:do_strip+0x0/0x358 
| Jul 22 04:11:57 srvname kernel: [<ffffffff88545149>]
| :gfs2:gfs2_glock_dq+0x1e/0x132 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8020b7bf>] kfree+0x15/0xc5 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8853df97>]
| :gfs2:gfs2_truncatei+0x5e5/0x70d 
| Jul 22 04:11:57 srvname kernel: [<ffffffff88545149>]
| :gfs2:gfs2_glock_dq+0x1e/0x132 
| Jul 22 04:11:57 srvname kernel: [<ffffffff88544b28>]
| :gfs2:gfs2_glock_put+0x1a/0xe2 
| Jul 22 04:11:57 srvname kernel: [<ffffffff88550b83>]
| :gfs2:gfs2_setattr+0xe6/0x335 
| Jul 22 04:11:57 srvname kernel: [<ffffffff88550acd>]
| :gfs2:gfs2_setattr+0x30/0x335 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8026349f>]
| __down_write_nested+0x35/0x9a 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8022caf2>]
| notify_change+0x145/0x2e0 
| Jul 22 04:11:57 srvname kernel: [<ffffffff802ce6ae>]
| do_truncate+0x5e/0x79 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8020db96>]
| permission+0x81/0xc8 
| Jul 22 04:11:57 srvname kernel: [<ffffffff80212b01>]
| may_open+0x1d3/0x22e 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8021b1c2>]
| open_namei+0x2c4/0x6d5 
| Jul 22 04:11:57 srvname kernel: [<ffffffff802275c9>]
| do_filp_open+0x1c/0x38 
| Jul 22 04:11:57 srvname kernel: [<ffffffff80219d14>]
| do_sys_open+0x44/0xbe 
| Jul 22 04:11:57 srvname kernel: [<ffffffff8025f2f9>]
| tracesys+0xab/0xb6 
| Jul 22 04:11:57 srvname kernel: 
| 
| Tail end of the gfs2_fsck run:
| Ondisk and fsck bitmaps differ at block 1633350906 (0x615af4fa)
| Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
| Metadata type is 0 (free)
| Succeeded.
| Ondisk and fsck bitmaps differ at block 1633350907 (0x615af4fb)
| Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
| Metadata type is 0 (free)
| Succeeded.
| RG #1633288608 (0x615a01a0) free count inconsistent: is 48931 should
| be 49441
| Resource group counts updated
| Pass5 complete
| Writing changes to disk
| gfs2_fsck complete
| 
| PS. Access to this file which we know has caused us to crash for sure:
| mv file1 file2 and rm -f file1 and echo "asdf" >file1 

Hi Wendell,

The fsck.gfs2 program should fix this kind of error, but there
are some known fsck bugs.  I've been working on a big fix for
fsck.gfs and fsck.gfs2 lately that solves many problems, and there
is a chance it will solve yours.  If you don't mind, perhaps you
can send me a copy of your file system metadata, and I will
run my latest-and-greatest against it to see if it will detect
and fix the problem.  Even if it doesn't, perhaps I can even adjust
my patch to fix your file system.

Regards,

Bob Peterson
Red Hat File Systems

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster