Re: EXT4-fs: group descriptors corrupted!

Greg Freemyer <greg.freemyer@xxxxxxxxx> · Wed, 25 Feb 2009 18:41:42 -0500

Smart ass comment about the new ATA spec intentionally top-posted.

Question:  How do you know those sectors did not somehow get
discarded, then modified behind the scenes by a SSD, then fixated to
new deterministic values by a read.

Answer: Because devices that do that aren't shipping yet.

Damn the future looks good from here.

On Wed, Feb 25, 2009 at 6:18 PM, Theodore Tso <tytso@xxxxxxx> wrote:
> Huh.  OK, there's something really strange going on here.
>
> The kernel never updates the backup superblock; that's by design, to
> avoid corruption problems.  So for example, on my laptop, if I run
> dumpe2fs on my root partition, I see this:
>
> Filesystem created:       Fri Feb 13 09:00:02 2009
> Last mount time:          Tue Feb 24 14:34:19 2009
> Last write time:          Tue Feb 24 14:34:19 2009
> Mount count:              3
> Maximum mount count:      30
> Last checked:             Sat Feb 14 10:46:41 2009
> Check interval:           15552000 (6 months)
> Next check after:         Thu Aug 13 11:46:41 2009
>
> However, if I run dumpe2fs -o superblock=32768 on my root partition,
> I'll see this:
>
> Filesystem created:       Fri Feb 13 09:00:02 2009
> Last mount time:          Fri Feb 13 11:22:06 2009
> Last write time:          Sat Feb 14 10:47:11 2009
> Mount count:              0
> Maximum mount count:      30
> Last checked:             Sat Feb 14 10:46:41 2009
> Check interval:           15552000 (6 months)
> Next check after:         Thu Aug 13 11:46:41 2009
>
> Note the difference in the "last write time" and the "last mount
> time".  That's because normally we avoid touching the backup
> superblocks.
>
> Now let's take a look at your dumpe2fs output.  In your case, we see
> the following:
>
> Filesystem created:       Thu Jan 22 19:33:20 2009
> Last mount time:          Fri Jan 23 16:23:58 2009
> Last write time:          Sun Feb 22 02:31:02 2009
> Mount count:              1
> Maximum mount count:      24
> Last checked:             Fri Jan 23 16:19:49 2009
> Check interval:           15552000 (6 months)
> Next check after:         Wed Jul 22 17:19:49 2009
>
> and it's the same on both the primary and backup (dumpe2fs -o
> superblock=32768).  The question is how the heck did *that* happen?
> As I mentioned, the kernel doesn't even have code to touch the backup
> superblock.  That would tend to implicate one of the e2fsprogs tools,
> or sometihng using the e2fsprogs libraries --- but the recent
> libraries (and you're using e2fsprogs 1.41.x) also avoid touching the
> backup superblocks.  The only tools that could have done it from
> e2fsprogs userland are e2fsck, tune2fs, and resize2fs, and that
> doesn't explain how the values turned out to be pure garbage.
>
> Does that the "last write" timestamp suggest anything to you?  What
> was happening on the system at or around Sun Feb 22 02:31:02 2009?
> Maybe if we can localize this down to what userspace program caused
> the problem, it'll be a hint.
>
> (This is why I didn't want you to run e2fsck just yet; if you had, it
> would have overwritten the last write time, which could be a value
> clue as to what is causing this problem.)
>
> As far as how to recover your data, what I would recommend doing is
> creating a writeable LVM snapshot, with a pretty good amount of space.
> Then try running the command "mke2fs -S " on the snapshot, with
> *precisely* the same mke2fs arguments and /etc/mke2fs.conf that you
> used to create the filesystem in the first place.  Then cross your
> fingers, and e2fsck on the snapshot, and see how much of the data you
> can recover; some of it may end up in lost+found, but hopefully you'll
> get most of the data back.  If it works on snapshot, only then try it
> on the real LVM.  If it doesn't work out on the snapshot, you can
> always discard it and try again without further corrupting any of your
> original filesystem.
>
> Good luck, and thanks in advance for anything information you can give
> us to help track down this problem.  And this point I'm going to guess
> that it's a nasty e2fsprogs bug, where somehow the internal in-memory
> version of the block group descriptors got corrupted, and then gotten
> writen out to disk.  But this is just a guess at this point --- and
> I'm still left wondering why I haven't seen it on my systems and on my
> regression testing.
>
>                                            - Ted
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html