Re: fsck.ext4: Group descriptors look bad... trying backup blocks...

Eric Sandeen <sandeen@xxxxxxxxxx> · Mon, 20 Apr 2009 09:49:59 -0500

Theodore Tso wrote:
> On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote:
>> It takes a day or two to do the sync. I've only done it twice (one with  
>> the old kernel, once with the new fedora testing kernel) and it happened  
>> both times. I'm afraid the statistics are rather low number here.
>>
>> I did a different faster test (just copying my home directory lots of  
>> times), but I wasn't able to get it to fail. That test didn't use much  
>> disk space, however. Maybe it's worth just dd'ing a few TB of data onto  
>> the device and seeing whether that fails.
>>
>> I didn't reboot this time - I did last time. I just unmounted the file  
>> system and fsckd it. The filesystem is 8.2TB and the data is around 
>> 2.5TB.

I think trying a filesystem with just under 8T would be a useful test too.

> That's that's useful data.  I wish we could make it fail more quickly
> on a smaller rsync, but the fact that you didn't need to reboot is
> definitely useful information.
> 
> And this is a fresh rsync so no files were being deleted, rsync should
> have just been writing new files to .filename.XXXXX and then renaming
> the filename to filename.XXXXX when it is done, right? 
> 
> OK, let me think about this a little.  I think we can create a patch
> which checks for writes to the block group descriptors and dumps a
> stack trace.  That would allow us catch the failing code in question
> in the act, and maybe figure out what is going on.

XFS has block-zero tests, because there was once a bug where
uninitialized block numbers in buffers were clobbering the superblock at
block 0.  It was helpful, so I think this is a good idea, Ted.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html