resizefs shrinking is extremely slow

Jaco Kroon <jaco@xxxxxxxxx> · Wed, 8 Aug 2018 11:06:51 +0200

Hi All,

So I've utilized debugfs to manually shrink my 64TB (corrupt) filesystem
by one block group in order to be able to utilize resize2fs.  So as I
understand it the process basically works as follows (either in a single
pass, or even on a per block group basis, and in theory part of a block
group):

* Identify blocks that's in use that will no longer be available.
* Identify inodes that's in use that will no longer be available.
* perform an inode scan over the whole filesystem, performing the following:
    - re-allocate any extents (blocks) to new locations (affects only a
single inode).
    - find links to inodes that won't be available any more.
* for each no-longer-available inode, re-allocate a new inode, and
update all references to it.
* update the superblock to indicate the updated filesystem size.

My issue is that I'm busy shrinking a 64TB-128MB down to 56TB, and it's
been in excess of 72 hours now.  Using debugfs (git master + previously
posted custom patch) a check for blocks (testb block count) in use takes
almost 11 minutes (most of this time is spent opening the filesystem and
the actual check takes a few seconds.  I can't imagine that testi is
much more complicated than this, and checking a few hundred inodes
should also take seconds (there is a bitmap indicating use, testi takes
a filespec and can only test a single inode, but a variant of this that
takes numbers and uses the bitmaps should be possible, so this too is
seconds.

A full walk of the inode tree takes approximately 10-12 hours.  This is
for each of icheck as well as ncheck.  In this case we don't care about
names of in-use blocks, so both these scans can be combined, and since
based on previous checks it's mostly "small reads" that's time
consuming, I guess we can assume that a combined scan will be <20 hours.

Given that worst case 8TB of data needs to be copied (statistically
7TB), and I've seen reads max out at 700MB/s+ on this system, with
writes frequently seeing 450MB, I'm going to guess that migration of
200MB/s is not completely unreasonable.  Which means that 8TB worth of
block migrations results in ~ 42000 seconds, or just under 12 hours.  So
full shrink should be approximately 32 hours in total, at 100MB/s two
days.  I'm now over 3 days.

Disk write seems to be going at a few KB/s, and CPU isn't high either,
so I can only deduce minuscule reads + writes currently.  Unfortunately
I did not pass -p to the resize2fs command.

In terms of an on-line shrink (in which case I personally don't care if
a single shrink takes a week), I've been wondering, also based on
comments from Ted regarding preventing a mounted filesystem from
allocating from the high block group which I needed to drop to get back
to a non-corrupted filesystem.  Not sure if it's worth the effort, but
still wondering about this.  And seeing that a single shrink for me is
now sitting at >72 hours and I'll need at least 7 more such iterations,
possibly closer to 10 ... might be worth it.

- add code to mark maximum block and inode numbers available for
allocation.  In other words, stop allocating from space that will no
longer be available.
- re-purpose the defrag code that can migrate blocks online to migrate
currently in-use blocks/extents.  Might just as well attempt to defrag a
little whilst doing this anyway.

The tricky (read:  hard, based on what I know) part will be to free
inodes since additional links can be added at any point in time.  So
code may need to be added to the code that adds links to add the link to
the new inode instead, and a remapping will need to be kept in-kernel
during the operation.  This can also result in inode numbers for files
changing from a userspace perspective which for most applications is
unlikely to be a problem, but what about tools like find or tar that
utilizes these numbers to determine if two files are hard-links?  Or du
that uses this to list actual storage used instead of perceived?  My
use-case is predominantly rsync, where inode numbers may very well also
be utilized to determine hard-links (-H option).

Another big problem here is that I suspect this will affect general
performance negatively even when a resize operation is not in progress.

Would love opinions.  And yes, I am well aware that shrinking
filesystems is not an operation that is performed frequently.  In my
case it's used as part of migration to lower the number of inodes /
block group.

Kind Regards,
Jaco