Re: allowing ext4 file systems that wrapped inode count to continue working

Jaco Kroon <jaco@xxxxxxxxx> · Thu, 26 Jul 2018 19:47:10 +0200



    Hi Andreas, Ted,

    
    Ted, you mostly just expanded on Andreas's information regarding
    reducing the filesystem to "sane" state.  Specifically useful
    information on dropping the last group.  This may well come in
    useful.  Whilst I'm working in a test environment at the moment, my
    real problem is this:

    
    # df -m /home

    Filesystem                      1M-blocks          Used   Available
    Use% Mounted on

    /dev/mapper/lvm-home  66055848 65023779    1032053  99% /home

    
    I really need to further expand that filesystem.  I can take it
    offline for a few hours or so if there is no other options, but
    that's not ideal (even getting to run umount when nothing is
    accessing that is a scarce opportunity).  The VG on which it's
    contained does have 4.5TB available for expansion, I just don't want
    to allocate that anywhere until I have a known working strategy.

    
    I'll respond to Andreas's information below.  Please do keep in mind
    that whilst I'm a long time Linux users (nearly 20 years), and have
    a sensible amount of development experience, I'm by no means a
    filesystem (not mention ext4) expert, and I may well misinterpret
    some available information that's available, and bark up wrong trees
    here.

    
    On 24/07/2018 18:33, Andreas Dilger wrote:

    > On Jul 24, 2018, at 9:00 AM, Jaco Kroon <jaco@xxxxxxxxx> wrote:
>>
>> Hi,
>>
>> Related to https://www.spinics.net/lists/linux-ext4/msg61075.html (and
>> possibly the cause of the the work from Jan in that patch series).
>>
>> I have a 64TB (exactly) filesystem.
>>
>> Filesystem OS type:       Linux
>> Inode count:              4294967295
>> Block count:              17179869184
>> Reserved block count:     689862348
>> Free blocks:              16910075355
>> Free inodes:              4294966285
>> First block:              0
>> Block size:               4096
>> Fragment size:            4096
>> Group descriptor size:    64
>> Blocks per group:         32768
>> Fragments per group:      32768
>> Inodes per group:         8192
>> Inode blocks per group:   512
>> RAID stride:              128
>> RAID stripe width:        128
>> First meta block group:   1152
>> Flex block group size:    16
>>
>> Note that in the above Inode count == 2^32-1 instead of the expected 2^32.
>>
>> This results in the correct inode count being exactly 2^32 (which
>> overflows to 0).  A kernel bug (fixed by Jan) allowed this overflow in
>> the first place.
>>
>> I'm busy trying to write a patch for e2fsck that would allow it to (on
>> top of the referenced series by Jan) enable fsck to at least clear the
>> filesystem from other errors where currently if I hack the inode count
>> to ~0U fsck, tune2fs and friends fail.
>
> Probably the easiest way to move forward here would be to use debugfs
> to edit the superblock to reduce the blocks count by s_blocks_per_group
> and the inode count by (s_inodes_per_group - 1) so e2fsck doesn't think
> you have that last group at all.  This assumes that you do not have any
> inodes allocated in the last group, which is unlikely.  If you do, you
> could use "ncheck" to find the names of those files and copy them to
> some other part of the filesystem before editing the superblock.
This relates to Ted's information as well.  Working strategy
    "short term".  Ie, next few days.

    >
>
>> With the attached patch (sorry, Thunderbird breaks my inlining of
>> patches) tune2fs operates (-l at least) as expected, and fsck gets to
>> pass5 where it segfaults with the following stack trace (compiled with -O0):
>>
>> /dev/exp/exp contains a file system with errors, check forced.
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 2: Checking directory structure
>> Pass 3: Checking directory connectivity
>> Pass 4: Checking reference counts
>> Pass 5: Checking group summary information
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x00005555555ac8d1 in ext2fs_bg_flags_test (fs=0x555555811e90,
>> group=552320, bg_flag=1)
>>     at blknum.c:445
>> 445             return gdp->bg_flags & bg_flag;
>> (gdb) bt
>> #0  0x00005555555ac8d1 in ext2fs_bg_flags_test (fs=0x555555811e90,
>> group=552320, bg_flag=1)
>>     at blknum.c:445
>> #1  0x000055555558c343 in check_inode_bitmaps (ctx=0x5555558112b0) at
>> pass5.c:759
>> #2  0x000055555558a251 in e2fsck_pass5 (ctx=0x5555558112b0) at pass5.c:57
>> #3  0x000055555556fb48 in e2fsck_run (ctx=0x5555558112b0) at e2fsck.c:249
>> #4  0x000055555556e849 in main (argc=5, argv=0x7fffffffdfe8) at unix.c:1859
>> (gdb) print *gdp
>> $1 = {bg_block_bitmap = 528400, bg_inode_bitmap = 0, bg_inode_table =
>> 528456,
>>   bg_free_blocks_count = 0, bg_free_inodes_count = 0, bg_used_dirs_count
>> = 4000, bg_flags = 8,
>>   bg_exclude_bitmap_lo = 0, bg_block_bitmap_csum_lo = 0,
>> bg_inode_bitmap_csum_lo = 8,
>>   bg_itable_unused = 0, bg_checksum = 0, bg_block_bitmap_hi = 528344,
>> bg_inode_bitmap_hi = 0,
>>   bg_inode_table_hi = 528512, bg_free_blocks_count_hi = 0,
>> bg_free_inodes_count_hi = 0,
>>   bg_used_dirs_count_hi = 4280, bg_itable_unused_hi = 8,
>> bg_exclude_bitmap_hi = 0,
>>   bg_block_bitmap_csum_hi = 0, bg_inode_bitmap_csum_hi = 0, bg_reserved = 0}
>>
>> ... so I'm not sure why it even segfaults.  gdb can retrieve a value of
>> 8 for bg_flags ... and yet, if the code does that it segfaults.  So not
>> sure what the discrepancy is there - probably a misunderstanding of
>> what's going wrong, but the only thing I can see that can segfault is
>> the gdp dereference, and since that seems to be a valid pointer ...
>>
>> I am not sure if this is a separate issue, or due to me tampering with
>> the inode counter in the way that I am (I have to assume the latter).
>> For testing I created a thin volume (1TB) in a separate environment,
>> where I created a 16TB filesystem initially, and then expanded that to
>> 64TB, resulting in exactly the same symptoms we saw in production
>> environment.  I created a thousand empty files in the root folder.  The
>> filesystem is consuming 100GB on-disk currently in the thin volume.
>> Note that group=552320 > 524288 (17179869184 / 32768).
>
> I was looking at the code in check_inode_bitmaps() and there are
> definitely some risky areas in the loop handling.  The bitmap end
> and bitmap start values are __u64, so they should be OK.  The loop
> counter "i" is only a 32-bit value, so it may overflow with 2^32-1
> inodes (or really 2^32 inodes in the table, even if the last one
> is not used).
>
> I think you need to figure out why the group counter has exceeded
> the actual number of groups.  It is likely that the segfault is
> justified by going beyond the end of the array, as there is no
> valid group data or inode table for the groups.
>
> One possibility is if the last group us marked EXT2_BG_INODE_UNINIT
> then it will increment "i" by s_inodes_per_group and the loop
> condition will never be false.  Converting i to a __u64 variable may solve the problem in this code.
>
> While I'm not against fixing this, I also suspect that there are
> other parts of the code that may have similar problems, which may
> be dangerous if you are resizing the filesystem
I confirmed the overflow occurs in the initial skip_group code. 
    Essentially it increments i by s_inodes_per_group-1, which takes it
    to exactly 0.  The outer loop then increments that by 1, and back to
    1, and we keep looping.  The reason for this is that the last block
    has one fewer inode than the other blocks (effectively).

    
    I've modified this slightly:

    
    a/e2fsck/pass5.c

    +++ b/e2fsck/pass5.c

    @@ -644,6 +644,8 @@ redo_counts:

                                    group_free = inodes;

                                    free_inodes += inodes;

                                    i += inodes;

    +                               if (i == 0 || i >
    fs->super->s_inodes_count)

    +                                       i =
    fs->super->s_inodes_count;

                                    skip_group = 0;

                                    goto do_counts;

                            }

    
    And now I'm getting:

    
    Internal error: fudging end of bitmap (2)

    
    If I understand correctly this comes from check_inode_end(),
    specifically, the second loop.

    
    end = EXT2_INODES_PER_GROUP(fs->super) * fs->group_desc_count;

    
    This value should match s_inodes_count right?  With:

    
    @@ -841,7 +843,7 @@ static void check_inode_end(e2fsck_t ctx)

     
            clear_problem_context(&pctx);

     
    -       end = EXT2_INODES_PER_GROUP(fs->super) *
    fs->group_desc_count;

    +       end = fs->super->s_inodes_count;

            pctx.errcode =
    ext2fs_fudge_inode_bitmap_end(fs->inode_map, end,

                                                        
    &save_inodes_count);

            if (pctx.errcode) {

    
    Then is just failes with fudging end of bitmap (1).

    
    This leads me to believe that if s_inodes_count is reduced to 2^32 -
    s_inodes_per_group instead of 2^32-1 (assuming those inodes are not
    in use) then things should work - opinions?

    
    How can I verify if any of those inodes are currently used?

    
    >
>> Regarding further expansion, would appreciate some advise, there are two
>> (three) possible options that I could come up with:
>>
>> 1.  Find a way to reduce the number of inodes per group (say to 4096,
>> which would require re-allocating all inodes >= 2^31 to inodes <2^31).
>
> Once you can properly access the filesystem (e.g. after editing the
> superblock to shrink it by one group), then I believe Darrick added
> support to resize2fs (or tune2fs) to reduce the number of inodes in
> the filesystem.  The time needed to run this depends on how many
> inodes are in use in the filesystem.
If he has I cannot locate it.  I checked the e2fsprogs git logs
    for all patches by him and could not locate anything, a quick check
    of everything 2017 and 2018 didn't reveal anything either (the
    latter check was less comprehensive and I could have missed).

    >
>
> I'd strongly recommend to make a backup of the filesystem before such
> an operation, since if there is a bug or interruption it could leave
> you with a broken filesystem.
Unfortunately not option.  If I had that kind of space available
    I'd back it up and create a new filesystem and copy back.  We only
    have around 4.5TB spare currently on the VG that's unallocated.

    
    >>
>>
>> (3. Find some free space, create a new filesystem, and iteratively move
>> data from the one to the other, shrinking and growing the filesystems as
>> per progress - will never be able to move more data that what is
>> curently available on the system, around 4TB in my case, so will take a
>> VERY long time).


    I am contemplating creating a new 4TB filesystem in that space,
    mounting that in the "correct" location (would need to find a gap to
    umount the old fs first) and symlinking the top-level folders over. 
    From there I'd need to rsync (cp -a) a top-level folder at a time
    over, remove the symlink (breaking access), revalidating (rsync) and
    then rename into the correct location.  Once the "new" filesystem is
    depleted of space I'd need to offline the old filesystem, shrink it,
    lvreduce the blockdevice and online it again, allocate the released
    storage to the new filesystem, and extend that.  I'll then need to
    iteratively do that until all data has been moved over.  To shrink
    is a long and slow operation in general.  And the filesystem is
    under heavy pressure (reads are peaking around 600MB/s, average
    around 150MB/s, with very few "idle" times).

    
    A blockage here would be if one of the top-level folders (which is
    the only level at which I can guarantee that there is no
    hard-linking between folders) is larger than my total free space
    currently.  I've started a du process for this, and currently the
    largest is 3.5TB, but it's still calculating (has been going since I
    sent my first email).

    
    Even with this strategy (which I'm starting to think is the way to
    go) I first need to be able to get rid of that last group, and ideas
    presented by Ted only works if debugfs works (which given my
    previous patch it would, but fsck still won't clear the filesystem -
    which may be fine).  After reducing the block, fsck should be fine
    again and I can set out on the above strategy, which would need to
    be executed over days, probably weeks.

    >
>> 2.  Allow to add additional blocks to the filesystem, without adding
>> additional inodes.
>>
>> I'm currently aiming for option 2 since that looks to be the simplest.
>> Simply allow overflow to happen, but don't allocate additional inodes
>> if number of inodes is already ~0U.
>
> This would be a useful feature to have, especially since we allow
> FLEX_BG to put all of the metadata at the start of the filesystem.
> This would essentially allow growing the inode bitmap independently
> of the block bitmap, which can definitely be convenient in some cases.
>
> It may touch a lot of places in the code, but at least it would be
> pretty easily tested, since it could be used on small filesystems.


    I assume by creating the filesystem with -N with a LARGE number of
    inodes, and then resising from there?

    
    Since we normally perform on-line resizes I figured I'll give that a
    try first.  So as I looked at the code there are a few things that I
    notice:

    
    Online resizing tries three approaches with the kernel:

    
    1.  ioctl EXT4_IOC_RESIZE_FS (unless requested not to, or it fails,
    then);

    2.  ioctl EXT2_IOC_GROUP_EXTEND - if this succeeds (as per test via
    ext2fs_read_bitmaps() we're done).

    3.  flex groups gets cleared.  EXT2_IOC_GROUP_EXTEND is used to
    extend the last group.  Then a sequence of EXT2_IOC_GROUP_ADD ioct's
    is used to add more groups.

    
    In my mind we only need to really care about IOC_RESIZE_FS ioct
    here?

    
    # ext4_resize_fs - checks for inode overflow correctly now (fixed by
    Jan).  This check will need to go away in light that we want to be
    able to add more blocks without adding more inodes.

    
    # it looks like we will still need to allocate flex groups, but that
    not every flex group will have inodes.  So really I'm not sure how
    to approach this.  Pass5 above depends on the inode count to match
    the group counts if I understand it correctly in order to rebuild
    bitmaps and to determine if the filesystem

    
    # On the other hand, it looks to me like when growing the filesystem
    the new blocks are simply added to the last group (Call to
    ext4_group_extend_no_check).  Can we simply stop here?  Or am I
    misunderstanding that this only adds blocks here to complete the
    group?  In other words, let's say there is normally 128MB in a
    group, and the last group was 96MB, this will just use the first
    32MB to complete that group first?

    
    # We then loop adding additional flex groups (which could be normal
    groups if s_log_groups_per_flex == 0), outputting progress every 10
    seconds or so.

    
    # The ext4_setup_next_flex is most likely where changes needs to be
    made, or perhaps the functions called by it. 

      Specifically - it should not add additional inodes if this will
    cause an overflow (or only add up to ~0U - which as per above causes
    other difficulty at least to my skill level?).  Might be simpler to
    still allocate them in the group (2MB worth of blocks with 8192
    inodes/group) but just not make them available for use.  Still
    having them allocated at least means that all the normal
    clusters_per_group, inodes_per_group and other related calculations
    remain in tact, but to make that 2MB of data available pretty much
    means that any groups beyond the point where inode count would
    overflow would need to have those values specially treated.

    
    I think I've realized this is somewhat above my skill level, and
    more complicated than what I had hoped.  I think I better get
    cracking on recreating a new filesystem and starting to move data
    over.

    
    Kind Regards,

    Jaco