Re: recovering failed resize2fs

Curtis Doty <Curtis@xxxxxxxxxxxx> · Tue, 21 Oct 2008 14:44:33 -0700 (PDT)

Sunday Theodore Tso said:

On Sat, Oct 18, 2008 at 04:20:13PM -0700, Curtis Doty wrote:
4:29pm Theodore Tso said:

On Sat, Oct 18, 2008 at 12:55:56PM -0700, Curtis Doty wrote:
While attempting to expand a 1.64T ext4 volume to 2.18T the F9 kernel
deadlocked. (I have photo of screen/oops if anybody's interested.)

Yes, that would be useful, thanks.

Three photos of same: http://www.greenkey.net/~curtis/linux/

The rest had scrolled off, so maybe that soft lockup was a secondary
effect rather than true cause? It was re-appearing every minute.

Looks like the kernel wedged due to running out of memory.  The calls
to shrink_zone(), shrink_inactive_list(), try_to_release_page(),
etc. tends to indicate that the system was frantically trying to find
free physical memory at the time.  It may or may not have been caused
by the online resize; how much memory does your system have, and what
else was going on at the time?  It may have been that something *else*
had been leaking memory at the time, and this pushed it over the line.

The system had been a couple months and doing significant i/o on the ext4 
volume. And indeed it had been having periodic memory/swap issues:

http://www.greenkey.net/~curtis/linux/cracker-kernel.2008-10-21

It's also the case that the online resize is journaled, so it should
have been safe; but I'm guessing that the system was thrashing so
hard, and you didn't have barriers enabled, and this resulted in the
filesystem getting corrupted.

Some other observations...

 - a snapshot in a different vg blew up a few days prior; it was deleted
 - ran vgs a few times in another vty during resize2fs *immediately* 
before crash

Hmm... This sounds like the needs recovery flag was set on the backup
superblock, which should never happen.  Before we try something more
extreme, see if this helps you:

e2fsck -b 32768 -B 4096 /dev/where-inst-is-located

That forces the use of the backup superblock right away, and might
help you get past the initial error.

Same as before. :-(

# e2fsck -b32768 -B4096 -C0 /dev/dat/inst
e2fsck 1.41.0 (10-Jul-2008)
inst: recovering journal
e2fsck: unable to set superblock flags on inst

It appears *all* superblocks are same as that first 32768 by iterating
over all superblocks shown in mkfs -n output says so.

I'm inclined to just force reduce the underlying lvm. It was 100% full
before I extended and tried to resize. And I know the only writes on the
new lvm extent would have been from resize2fs. It that wise?

No, force reducing the underlying LVM is only going to make things
worse, since it doesn't fix the filesystem.

So this is what I would do.  Create a snapshot and try this on the
snapshot first:

% lvcreate -s -L 10G -n inst-snapshot /dev/dat/inst
% debugfs -w /dev/dat/inst-snapshot
debugfs: features ^needs_recovery
debugfs: quit
% e2fsck -C 0 /dev/dat/inst

Done, but no change. :-(

EXT4-fs: ext4_check_descriptors: Block bitmap for group 13413 not in group (block 0)!<3>EXT4-fs: group descriptors corrupted!

This will skip running the journal, but there's no guarantee the
journal is valid anyway.

If this turns into a mess, you can throw away the snapshot and try
something else.  (The something else would require writing a C program
that removes the needs_recovery from all the backup superblock, but
keeping it set on the master superbock.  That's more work, so let's
try this way first.)

How does that something else work?

../C

_______________________________________________
Ext3-users mailing list
Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users