Hi Ted, Thank you for a very comprehensive response. Most of this data is my backups. Which is now lagging by more than a week (making me very nervous). I never operate a production system without backups being in place. Thankfully the only need ever for these to date were due to human error and not due to system failures. Seeing that I've got a few more iterations of this (at least another 7), do you think there are some easy wins that could be made? Even a 5 or 10% optimization here would be greatly beneficial. To summarize based on what you mentioned, I *think* the steps would be as follows (for an online resize, driven from userspace): 1. Implement a mechanism (ioct) to avoid allocation over a certain block + inode. Based on what I've seen I doubt this would not be overly difficult. Can be "temporary in memory" restriction, or be persisted to the superblock. I'm inclined to opt for the former as this won't require disk-layout changes, but please advise (disk layout changes will be required for step 2 anyway). 2. Implement "forwarding address" mechanism for inodes (ioctl). This would allow re-allocation of inodes. This will possibly involve (from what I know, possibly more): 2.0 documentation changes to document changes on on-disk, including updates to userspace tools to accomodate changes on-disk format. 2.1 on open of inode, if inode is forwarding, open forwarded to inode instead. 2.2 readdir() - not sure if this does a kind of stat on referenced inodes by default (readdir(2) implies not, readdir(3) implies some filesystems do, including ext4 - d_type field), update references to forwarded inodes to reference the forwarded to inode instead (if mounted rw). 2.3 extra ioctl to clone inode to newly allocated inode, and replace forward pointer. 3. Utilize defrag (variant) code to re-allocate extents (will require full-system scan of all extent trees to find which extents needs to be re-allocated) and inodes (this will likely need to be a single logical change, with a separate path to move code from e4defrag that can be shared, care to be taken if a mount-point is masking files inside a directory that represents a mountpoint, or possibly just warn that step 3.6 may fail): 3.1. Use (1) to mark that we want to start a filesystem reduce. 3.2. Use ioctl() in 2.3 to re-allocate (forward) all inodes that will no longer be available. 3.4. Open the root (/) inode, and initiate a readdir() scan of the whole filesystem, serving two purposes: 3.4.1. It will trigger the code in 2.1 and 2.2 to update forwarding references. 3.4.2. It will allow us to find files utilizing blocks that needs to be re-allocated (get_file_extents as per current e4defrag code), and to re-allocate those. 3.5. Scan the extents for bad blocks and free any blocks that will no longer be under consideration. 3.6. Scan any other inodes specifically that may have blocks/inodes in the upper range. 4. Finalize the resize by reducing the number of blocks and inodes (another ioctl). Hoping my ordering is correct such that this can be delivered as 6 or 7 patch series. This new "feature" would require a feature bit from the non-backwards compatible set (ie, kernels trying to open a filesystem with the "forwarding active" bit set that doesn't support that feature should refuse to open the filesystem - bit can be cleared again by fsck if no such forwarding nodes are found). With this in place, if a crash happens there is nothing to do, filesystem will resume at full size, unless the "max alloc fields" are made permanent by writing them into the superblock. Each step has both a kernel and userspace component (separate patches, possibly the user-space code as a single-patch number 7?). For step 2 care will need to be taken to deal with double-forwards (ie, we restrict allocation, and then at a later stage further restrict allocation, or if after a crash a resize to an even smaller size is requested. More than happy to attempt the patches. It looks like it should be possible to utilize e4defrag as template for most of the required stuff, and there seems to be library functions in for most of the big stuff in place (or some functions can be moved from e4defrag into library). I do have a few questions though: 1. How do I go about writing (a) sensible test case(s) for this? 2. Based on the above process I expect it may actually be harder to optimize the user-space resize than going online, can anyone concur with my assessment? My assessment in part is based on a very rudementary gleaning of the e4defrag code. 3. What to do with reference counts on inodes? I would suggest: forwarded-to inode should have link count equal to all direct and indirect references, forwarding inodes should count only references still going through the forwarding inode (so that we can know when to free it). 4. How to deal with free block count? There are three approaches I can see: 4.1 Ostrich. Don't care, have df report inodes + blocks that cannot be allocated due to the restriction as free anyway. 4.2 In-memory only distinction, reporting via df and friends would be fine. 4.3 On-disk persistence (ie, adjust the values in superblock, would require changes to fsck too to compensate for this). I expect 4.2 would be the simplest, and if the max alloc is restricted in-memory too, the most sane approach, if max alloc is persisted, this could be just fine too, but then we might want to rather consider 4.3. 5. I only count threee extra ioctls that need to be added kernel space. The rest of the kernel changes affects existing code paths I believe. Is a separate module still worth it, and if it is, how would I approach that? 6. May I simply allocate the next available bit out of the feature set for this or is there some central database where this needs to go into (ie, step 2.0)? The advantage of going "online" (for me at least) here is that even if this process takes longer than offline resize I don't need to be offline (ie, I can probably do 1 or 2 TB at a time rather volumes). Even if this takes me a month or so longer in the larger scheme of things (my current ETA for full completion is AT LEAST another 9 weeks of being completely OFFLINE, possibly with burst of being online). I suspect that this approach may end up being an optimization in terms of speed anyway. And yea, I know this is a very infrequent use-case. Not so sure this doesn't happen, the general recommendation that I find on-line is as per the official party line, and I suspect most people just do exactly that. I would too if that was practical for me. Most organizations with such large filesystems have SAN systems available where handing me a new LUN for this purpose is not only practical, it's generally considered trivial and best practice with lowest risk too. I tend to agree with that sentiment. Kind Regards, Jaco On 08/08/2018 16:13, Theodore Y. Ts'o wrote: > Yeah, resize2fs shrinking is slow --- really slow. Part of that is > because the primary use cases were either for very small file systems > (e.g., shrinking an install image so it will fit on a 2 GB USB thumb > drive) or where the image was only shrunk by a very tiny amount > (shrinking the file system enough so that a full-disk partition could > be converted to be managed under LVM so a new disk could be added and > the file system grown to span the new disk). > > The other part of it is because resize2fs is very old code --- it was > written back before 64-bit file systems were a thing, and so it > doesn't use some of the new bitmap functions that are now much faster > when the bitmap is implemented using a red-black tree and not a bit > array. > > It was also written be super-conservative so blocks get moved one at a > time, as opposed to finding an contiguous extent that needs to be > moved, trying to allocate a new contiguous extent, and then copying > the blocks in one fell swoop. > > Could it be modified to be much faster --- surely. I'd suggest adding > some automated tests to make sure that we don't corrupt file systems, > and more importantly, file data, before trying to do surgery on > resize2fs, though. There might be some simple optimizations that > might speed up some of the stand-alone passes (e.g., identifying which > blocks or inodes need to be moved). But where you are spending the > most time is almost certainly the block moving pass, and that's one > where screw ups would end up losing data. > > The official party line is "resize2fs shrinking is an optimization for > backup, reformat, restore, and you should do a backup beforehand, so > that if resize2fs explodes in your face, you can just do the reformat > and restore procedure. And for massive shrinks, if resize2fs is > slower, just do the the backup, reformat, and restore path from the > get go; the resulting file system will be more efficiently laid out, > so file accesses will be faster." > > In your case, you have a file system so large that backup is not > practical --- I get that. (Although I do hope you *do* have some kind > of backup, and if it's precious data, off-line backup.) But as a > result, it's why this hasn't been high priority before. I think > you're actually the first user who has tried to do large-scale shrink > before --- at least, the first one I know of, anyway. :-) > >> In terms of an on-line shrink (in which case I personally don't care if >> a single shrink takes a week), I've been wondering, also based on >> comments from Ted regarding preventing a mounted filesystem from >> allocating from the high block group which I needed to drop to get back >> to a non-corrupted filesystem. > Yes, we've brain-stormed before about adding ioctl's that would > constrain allocation to certain regions. We could combine that with > the defrag code (although that's not really written to be > super-efficient for large data movements, either --- again, because it > predates 64-bit file systems) to at least move the data blocks below > the shrink point. > > The trick, as you have pointed out, will be moving the inodes for the > portions of the inode table that needs to be evacuated. In addition, > the defrag code doesn't handle directories at all, so the directory > blocks which are above the shrink code would require special handling. > >> The tricky (read: hard, based on what I know) part will be to free >> inodes since additional links can be added at any point in time. So >> code may need to be added to the code that adds links to add the link to >> the new inode instead, and a remapping will need to be kept in-kernel >> during the operation. > The other hard part is this would require the kernel to scan all > directories in the kernel space, which would be adding a lot of > complexity into the kernel. And the in-kernel data structures, if > there are a large number of inodes that need to be moved, we might > have to do multiple passes. > > We also would have to deal with what happens if we crash while the > on-line shrink was in progress. What we would probably have to do is > to update the inode that is to be moved with a "forwarding pointer" > which says, see inode 12345, and so if the kernel reads inode > 123456789, it would get inode 12345 to find the "real" inode. > >> This can also result in inode numbers for files >> changing from a userspace perspective which for most applications is >> unlikely to be a problem, but what about tools like find or tar that >> utilizes these numbers to determine if two files are hard-links? Or du >> that uses this to list actual storage used instead of perceived? My >> use-case is predominantly rsync, where inode numbers may very well also >> be utilized to determine hard-links (-H option). > The other problem is if the file system is being exported using NFS. > The forwarding pointer idea would help since old file handles would > reference the old inode number, and we could redirect to the new inode > for a while, but that could easily be considered a big mess. > >> Another big problem here is that I suspect this will affect general >> performance negatively even when a resize operation is not in progress. > I'd be a lot more worried about the dead code that might be required. > We'd probably want to put the shrinking code into a kernel module > which could be unloaded if it is not in use. > > And of course, the code and test maintenance to make sure the shrink > code doesn't bitrot over time. > > Cheers, > > - Ted