Re: resize2fs on ext4 leads to corruption

"Theodore Ts'o" <tytso@xxxxxxx> · Fri, 15 Apr 2022 17:22:51 -0400

On Sat, Apr 16, 2022 at 12:37:29AM +1000, Peter Urbanec wrote:
> # e2fsck -f -C 0 -b 32768 -z /root/20220415_2015_e2fsck-b_32768.e2undo
> /dev/md0
> e2fsck 1.46.4 (18-Aug-2021)
> Overwriting existing filesystem; this can be undone using the command:
>      e2undo /root/20220415_2015_e2fsck-b_32768.e2undo /dev/md0
> 
> e2fsck: Undo file corrupt while trying to open /dev/md0
> 
> The superblock could not be read or does not describe a valid ext2/ext3/ext4
> filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
> filesystem (and not swap or ufs or something else), then the superblock
> is corrupt, and you might try running e2fsck with an alternate superblock:
>      e2fsck -b 8193 <device>
>   or
>      e2fsck -b 32768 <device>

So the failure of "e2fsck -f -C 0 -b 32768 -z /root/e2fsck.e2undo
/dev/md0" appears to be a bug where e2fsck doesn't work correctly with
an undo file when using a backup superblock.  I can replicate this
using these commands:

	mke2fs -q -t ext4 /tmp/foo.img 2G
	e2fsck -b 32768 -z /tmp/undo /tmp/foo.img

Running e2fsck without the -z option succeeds.  The combination of the
-b and -z option seems to be broken.  As a workaround, I would suggest
doing is to try running e2fsck with -n, which will open the block
device read-only, e.g. "e2fsck -b 32768 -n /dev/mdXX".  If the changes
e2fsck look safe, then you can run e2fsck without the -n option.

As far as what happens, I wasn't able to replicate the problem when
resizing an empty file system from 3906946560 to 5860419840 blocks,
using a 64-bit binary.  I've stopped testing 32-bit builds quite a
while ago, and filling the file system would take more time than I
have at the moment.  I will note that 3906946560 fits in 32-bits,
while 5860419840 is larger than 2**32.  So it could very much be some
kind of a 32-bit block number overflow.

For better or for worse, I don't have the resources or time to test
the full set of combinations of file system features, and as I
mentioned, I don't test 32-bit builds any more, either.  Enterprise
distributions will provide paid support, but they tend to only support
a limited number of file system features, and not the full set of
combination of features.

While I'm grateful that Gentoo users seem to be super adventurous in
terms of turning every single feature they can find, and then send in
bug reports so we can improve the case base --- there are reasons why
features like sparse_super2 and inline_data are not enabled by
default, and I feel bad when people turn on non-standard features, and
then lose data because they aren't doing backups and they enable
features that haven't received as much testing as the default
features.

So I don't know if the problemw as due to some kind of bug in
resize2fs caused by the use of 64-bit block numbers on a 32-bit
binary, or due to the enablement of the sparse_super2 feature.  But
for now, let's see if we can get recover your file system, hopefully
with minimal data loss.  And since the using an undo file seems to be
problematic with running e2fsck -b, the alternatives are (a) do a full
block level backup of the file system before running e2fsck --- which
I know will be hard since this is a very large file system, and (b)
using e2fsck -n first and looking at what e2fsck would do first.

I will look at why "e2fsck -b 32768 -z /tmp/foo.undo /tmp/foo.img"
fails, but I may not get to it in a week or two.

Cheers,

						- Ted