Re: breaking ext4 to test recovery

Eric Sandeen <sandeen@xxxxxxxxxx> · Tue, 29 Mar 2011 17:33:11 -0500

On 3/29/11 5:26 PM, Daniel Taylor wrote:
> Thanks for the suggestions.  Tao Ma's got me started, but doing some
> of the more "devious" tests is on my list, too.
> 
> The original issue was that during component stress testing, we were
> seeing instances of the ext4 file system becoming "read-only" (showing
> in /proc/mounts, but not "mount").  Looking back through the logs, we
> saw that at mount time, there was a complaint about a corrupted journal.

So, did it go "read-only" right at mount time due to a journal replay
failure? Or ...

> Some writing had occurred before the change to read-only, however.

That makes it sound like it did get mounted ok... and then something
went wrong?  What did the logs say?

> The original mount script didn't check for any "mount" return value, so
> we theorized that ext4 just got to a point where it couldn't sensibly
> handle any more changes.

I'm not sure what that means, TBH :)

Just want to make sure you're barking up the right tree, here ...

-Eric

> It seemed that the right answer was to check the return value from mount
> and, if non-0, umount the file system, fix it, and try again.  To test
> the return value from mount, I need to be able to corrupt, but not
> destroy the journal, since the component tests were taking days to show
> the failure.
> 
> Running an "fsck -f" every time on a 3TB file system with an embedded
> PPC was just taking too much time to impose on a consumer-level customer.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html