Re: allowing ext4 file systems that wrapped inode count to continue working

Andreas Dilger <adilger@xxxxxxxxx> · Tue, 24 Jul 2018 10:33:50 -0600

On Jul 24, 2018, at 9:00 AM, Jaco Kroon <jaco@xxxxxxxxx> wrote:
> 
> Hi,
> 
> Related to https://www.spinics.net/lists/linux-ext4/msg61075.html (and
> possibly the cause of the the work from Jan in that patch series).
> 
> I have a 64TB (exactly) filesystem.
> 
> Filesystem OS type:       Linux
> Inode count:              4294967295
> Block count:              17179869184
> Reserved block count:     689862348
> Free blocks:              16910075355
> Free inodes:              4294966285
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Group descriptor size:    64
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         8192
> Inode blocks per group:   512
> RAID stride:              128
> RAID stripe width:        128
> First meta block group:   1152
> Flex block group size:    16
> 
> Note that in the above Inode count == 2^32-1 instead of the expected 2^32.
> 
> This results in the correct inode count being exactly 2^32 (which
> overflows to 0).  A kernel bug (fixed by Jan) allowed this overflow in
> the first place.
> 
> I'm busy trying to write a patch for e2fsck that would allow it to (on
> top of the referenced series by Jan) enable fsck to at least clear the
> filesystem from other errors where currently if I hack the inode count
> to ~0U fsck, tune2fs and friends fail.

Probably the easiest way to move forward here would be to use debugfs
to edit the superblock to reduce the blocks count by s_blocks_per_group
and the inode count by (s_inodes_per_group - 1) so e2fsck doesn't think
you have that last group at all.  This assumes that you do not have any
inodes allocated in the last group, which is unlikely.  If you do, you
could use "ncheck" to find the names of those files and copy them to
some other part of the filesystem before editing the superblock.

> With the attached patch (sorry, Thunderbird breaks my inlining of
> patches) tune2fs operates (-l at least) as expected, and fsck gets to
> pass5 where it segfaults with the following stack trace (compiled with -O0):
> 
> /dev/exp/exp contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x00005555555ac8d1 in ext2fs_bg_flags_test (fs=0x555555811e90,
> group=552320, bg_flag=1)
>     at blknum.c:445
> 445             return gdp->bg_flags & bg_flag;
> (gdb) bt
> #0  0x00005555555ac8d1 in ext2fs_bg_flags_test (fs=0x555555811e90,
> group=552320, bg_flag=1)
>     at blknum.c:445
> #1  0x000055555558c343 in check_inode_bitmaps (ctx=0x5555558112b0) at
> pass5.c:759
> #2  0x000055555558a251 in e2fsck_pass5 (ctx=0x5555558112b0) at pass5.c:57
> #3  0x000055555556fb48 in e2fsck_run (ctx=0x5555558112b0) at e2fsck.c:249
> #4  0x000055555556e849 in main (argc=5, argv=0x7fffffffdfe8) at unix.c:1859
> (gdb) print *gdp
> $1 = {bg_block_bitmap = 528400, bg_inode_bitmap = 0, bg_inode_table =
> 528456,
>   bg_free_blocks_count = 0, bg_free_inodes_count = 0, bg_used_dirs_count
> = 4000, bg_flags = 8,
>   bg_exclude_bitmap_lo = 0, bg_block_bitmap_csum_lo = 0,
> bg_inode_bitmap_csum_lo = 8,
>   bg_itable_unused = 0, bg_checksum = 0, bg_block_bitmap_hi = 528344,
> bg_inode_bitmap_hi = 0,
>   bg_inode_table_hi = 528512, bg_free_blocks_count_hi = 0,
> bg_free_inodes_count_hi = 0,
>   bg_used_dirs_count_hi = 4280, bg_itable_unused_hi = 8,
> bg_exclude_bitmap_hi = 0,
>   bg_block_bitmap_csum_hi = 0, bg_inode_bitmap_csum_hi = 0, bg_reserved = 0}
> 
> ... so I'm not sure why it even segfaults.  gdb can retrieve a value of
> 8 for bg_flags ... and yet, if the code does that it segfaults.  So not
> sure what the discrepancy is there - probably a misunderstanding of
> what's going wrong, but the only thing I can see that can segfault is
> the gdp dereference, and since that seems to be a valid pointer ...
> 
> I am not sure if this is a separate issue, or due to me tampering with
> the inode counter in the way that I am (I have to assume the latter).
> For testing I created a thin volume (1TB) in a separate environment,
> where I created a 16TB filesystem initially, and then expanded that to
> 64TB, resulting in exactly the same symptoms we saw in production
> environment.  I created a thousand empty files in the root folder.  The
> filesystem is consuming 100GB on-disk currently in the thin volume.
> Note that group=552320 > 524288 (17179869184 / 32768).

I was looking at the code in check_inode_bitmaps() and there are
definitely some risky areas in the loop handling.  The bitmap end
and bitmap start values are __u64, so they should be OK.  The loop
counter "i" is only a 32-bit value, so it may overflow with 2^32-1
inodes (or really 2^32 inodes in the table, even if the last one
is not used).

I think you need to figure out why the group counter has exceeded
the actual number of groups.  It is likely that the segfault is
justified by going beyond the end of the array, as there is no
valid group data or inode table for the groups.

One possibility is if the last group us marked EXT2_BG_INODE_UNINIT
then it will increment "i" by s_inodes_per_group and the loop
condition will never be false.  Converting i to a __u64 variable may solve the problem in this code.

While I'm not against fixing this, I also suspect that there are
other parts of the code that may have similar problems, which may
be dangerous if you are resizing the filesystem.

> Regarding further expansion, would appreciate some advise, there are two
> (three) possible options that I could come up with:
> 
> 1.  Find a way to reduce the number of inodes per group (say to 4096,
> which would require re-allocating all inodes >= 2^31 to inodes <2^31).

Once you can properly access the filesystem (e.g. after editing the
superblock to shrink it by one group), then I believe Darrick added
support to resize2fs (or tune2fs) to reduce the number of inodes in
the filesystem.  The time needed to run this depends on how many
inodes are in use in the filesystem.

I'd strongly recommend to make a backup of the filesystem before such
an operation, since if there is a bug or interruption it could leave
you with a broken filesystem.

> 2.  Allow to add additional blocks to the filesystem, without adding
> additional inodes.
> 
> (3. Find some free space, create a new filesystem, and iteratively move
> data from the one to the other, shrinking and growing the filesystems as
> per progress - will never be able to move more data that what is
> curently available on the system, around 4TB in my case, so will take a
> VERY long time).
> 
> I'm currently aiming for option 2 since that looks to be the simplest.
> Simply allow overflow to happen, but don't allocate additional inodes
> if number of inodes is already ~0U.

This would be a useful feature to have, especially since we allow
FLEX_BG to put all of the metadata at the start of the filesystem.
This would essentially allow growing the inode bitmap independently
of the block bitmap, which can definitely be convenient in some cases.

It may touch a lot of places in the code, but at least it would be
pretty easily tested, since it could be used on small filesystems.

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP