Re: resizefs shrinking is extremely slow

"Theodore Y. Ts'o" <tytso@xxxxxxx> · Wed, 8 Aug 2018 10:13:21 -0400

Yeah, resize2fs shrinking is slow --- really slow.  Part of that is
because the primary use cases were either for very small file systems
(e.g., shrinking an install image so it will fit on a 2 GB USB thumb
drive) or where the image was only shrunk by a very tiny amount
(shrinking the file system enough so that a full-disk partition could
be converted to be managed under LVM so a new disk could be added and
the file system grown to span the new disk).

The other part of it is because resize2fs is very old code --- it was
written back before 64-bit file systems were a thing, and so it
doesn't use some of the new bitmap functions that are now much faster
when the bitmap is implemented using a red-black tree and not a bit
array.

It was also written be super-conservative so blocks get moved one at a
time, as opposed to finding an contiguous extent that needs to be
moved, trying to allocate a new contiguous extent, and then copying
the blocks in one fell swoop.

Could it be modified to be much faster --- surely.  I'd suggest adding
some automated tests to make sure that we don't corrupt file systems,
and more importantly, file data, before trying to do surgery on
resize2fs, though.  There might be some simple optimizations that
might speed up some of the stand-alone passes (e.g., identifying which
blocks or inodes need to be moved).  But where you are spending the
most time is almost certainly the block moving pass, and that's one
where screw ups would end up losing data.

The official party line is "resize2fs shrinking is an optimization for
backup, reformat, restore, and you should do a backup beforehand, so
that if resize2fs explodes in your face, you can just do the reformat
and restore procedure.  And for massive shrinks, if resize2fs is
slower, just do the the backup, reformat, and restore path from the
get go; the resulting file system will be more efficiently laid out,
so file accesses will be faster."

In your case, you have a file system so large that backup is not
practical --- I get that.  (Although I do hope you *do* have some kind
of backup, and if it's precious data, off-line backup.)  But as a
result, it's why this hasn't been high priority before.  I think
you're actually the first user who has tried to do large-scale shrink
before --- at least, the first one I know of, anyway.  :-)

> In terms of an on-line shrink (in which case I personally don't care if
> a single shrink takes a week), I've been wondering, also based on
> comments from Ted regarding preventing a mounted filesystem from
> allocating from the high block group which I needed to drop to get back
> to a non-corrupted filesystem.

Yes, we've brain-stormed before about adding ioctl's that would
constrain allocation to certain regions.  We could combine that with
the defrag code (although that's not really written to be
super-efficient for large data movements, either --- again, because it
predates 64-bit file systems) to at least move the data blocks below
the shrink point.

The trick, as you have pointed out, will be moving the inodes for the
portions of the inode table that needs to be evacuated.  In addition,
the defrag code doesn't handle directories at all, so the directory
blocks which are above the shrink code would require special handling.

> The tricky (read:  hard, based on what I know) part will be to free
> inodes since additional links can be added at any point in time.  So
> code may need to be added to the code that adds links to add the link to
> the new inode instead, and a remapping will need to be kept in-kernel
> during the operation. 

The other hard part is this would require the kernel to scan all
directories in the kernel space, which would be adding a lot of
complexity into the kernel.  And the in-kernel data structures, if
there are a large number of inodes that need to be moved, we might
have to do multiple passes.

We also would have to deal with what happens if we crash while the
on-line shrink was in progress.  What we would probably have to do is
to update the inode that is to be moved with a "forwarding pointer"
which says, see inode 12345, and so if the kernel reads inode
123456789, it would get inode 12345 to find the "real" inode.

> This can also result in inode numbers for files
> changing from a userspace perspective which for most applications is
> unlikely to be a problem, but what about tools like find or tar that
> utilizes these numbers to determine if two files are hard-links?  Or du
> that uses this to list actual storage used instead of perceived?  My
> use-case is predominantly rsync, where inode numbers may very well also
> be utilized to determine hard-links (-H option).

The other problem is if the file system is being exported using NFS.
The forwarding pointer idea would help since old file handles would
reference the old inode number, and we could redirect to the new inode
for a while, but that could easily be considered a big mess.

> Another big problem here is that I suspect this will affect general
> performance negatively even when a resize operation is not in progress.

I'd be a lot more worried about the dead code that might be required.
We'd probably want to put the shrinking code into a kernel module
which could be unloaded if it is not in use.

And of course, the code and test maintenance to make sure the shrink
code doesn't bitrot over time.

Cheers,

						- Ted