Yeah, resize2fs shrinking is slow --- really slow. Part of that is because the primary use cases were either for very small file systems (e.g., shrinking an install image so it will fit on a 2 GB USB thumb drive) or where the image was only shrunk by a very tiny amount (shrinking the file system enough so that a full-disk partition could be converted to be managed under LVM so a new disk could be added and the file system grown to span the new disk). The other part of it is because resize2fs is very old code --- it was written back before 64-bit file systems were a thing, and so it doesn't use some of the new bitmap functions that are now much faster when the bitmap is implemented using a red-black tree and not a bit array. It was also written be super-conservative so blocks get moved one at a time, as opposed to finding an contiguous extent that needs to be moved, trying to allocate a new contiguous extent, and then copying the blocks in one fell swoop. Could it be modified to be much faster --- surely. I'd suggest adding some automated tests to make sure that we don't corrupt file systems, and more importantly, file data, before trying to do surgery on resize2fs, though. There might be some simple optimizations that might speed up some of the stand-alone passes (e.g., identifying which blocks or inodes need to be moved). But where you are spending the most time is almost certainly the block moving pass, and that's one where screw ups would end up losing data. The official party line is "resize2fs shrinking is an optimization for backup, reformat, restore, and you should do a backup beforehand, so that if resize2fs explodes in your face, you can just do the reformat and restore procedure. And for massive shrinks, if resize2fs is slower, just do the the backup, reformat, and restore path from the get go; the resulting file system will be more efficiently laid out, so file accesses will be faster." In your case, you have a file system so large that backup is not practical --- I get that. (Although I do hope you *do* have some kind of backup, and if it's precious data, off-line backup.) But as a result, it's why this hasn't been high priority before. I think you're actually the first user who has tried to do large-scale shrink before --- at least, the first one I know of, anyway. :-) > In terms of an on-line shrink (in which case I personally don't care if > a single shrink takes a week), I've been wondering, also based on > comments from Ted regarding preventing a mounted filesystem from > allocating from the high block group which I needed to drop to get back > to a non-corrupted filesystem. Yes, we've brain-stormed before about adding ioctl's that would constrain allocation to certain regions. We could combine that with the defrag code (although that's not really written to be super-efficient for large data movements, either --- again, because it predates 64-bit file systems) to at least move the data blocks below the shrink point. The trick, as you have pointed out, will be moving the inodes for the portions of the inode table that needs to be evacuated. In addition, the defrag code doesn't handle directories at all, so the directory blocks which are above the shrink code would require special handling. > The tricky (read: hard, based on what I know) part will be to free > inodes since additional links can be added at any point in time. So > code may need to be added to the code that adds links to add the link to > the new inode instead, and a remapping will need to be kept in-kernel > during the operation. The other hard part is this would require the kernel to scan all directories in the kernel space, which would be adding a lot of complexity into the kernel. And the in-kernel data structures, if there are a large number of inodes that need to be moved, we might have to do multiple passes. We also would have to deal with what happens if we crash while the on-line shrink was in progress. What we would probably have to do is to update the inode that is to be moved with a "forwarding pointer" which says, see inode 12345, and so if the kernel reads inode 123456789, it would get inode 12345 to find the "real" inode. > This can also result in inode numbers for files > changing from a userspace perspective which for most applications is > unlikely to be a problem, but what about tools like find or tar that > utilizes these numbers to determine if two files are hard-links? Or du > that uses this to list actual storage used instead of perceived? My > use-case is predominantly rsync, where inode numbers may very well also > be utilized to determine hard-links (-H option). The other problem is if the file system is being exported using NFS. The forwarding pointer idea would help since old file handles would reference the old inode number, and we could redirect to the new inode for a while, but that could easily be considered a big mess. > Another big problem here is that I suspect this will affect general > performance negatively even when a resize operation is not in progress. I'd be a lot more worried about the dead code that might be required. We'd probably want to put the shrinking code into a kernel module which could be unloaded if it is not in use. And of course, the code and test maintenance to make sure the shrink code doesn't bitrot over time. Cheers, - Ted