[ Please keep documentation text to 80 columns. ] [ Please run documentation through a spell checker - there are too many typos in this document to point them all out... ] On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote: > This patch set introduces defrag to xfs_spaceman command. It has the functionality and > features below (also subject to be added to man page, so please review): What's the use case for this? > defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a] > defrag defragments the specified XFS file online non-exclusively. The target XFS What's "non-exclusively" mean? How is this different to what xfs_fsr does? > doesn't have to (and must not) be unmunted. When defragmentation is in progress, file > IOs are served 'in parallel'. reflink feature must be enabled in the XFS. xfs_fsr allows IO to occur in parallel to defrag. > Defragmentation and file IOs > > The target file is virtually devided into many small segments. Segments are the > smallest units for defragmentation. Each segment is defragmented one by one in a > lock->defragment->unlock->idle manner. Userspace can't easily lock the file to prevent concurrent access. So I'mnot sure what you are refering to here. > File IOs are blocked when the target file is locked and are served during the > defragmentation idle time (file is unlocked). What file IOs are being served in parallel? The defragmentation IO? something else? > Though > the file IOs can't really go in parallel, they are not blocked long. The locking time > basically depends on the segment size. Smaller segments usually take less locking time > and thus IOs are blocked shorterly, bigger segments usually need more locking time and > IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO > service. How is a user supposed to know what the correct values are for their storage, files, and workload? Algorithms should auto tune, not require users and administrators to use trial and error to find the best numbers to feed a given operation. > Temporary file > > A temporary file is used for the defragmentation. The temporary file is created in the > same directory as the target file is and is named ".xfsdefrag_<pid>". It is a sparse > file and contains a defragmentation segment at a time. The temporary file is removed > automatically when defragmentation is done or is cancelled by ctrl-c. It remains in > case kernel crashes when defragmentation is going on. In that case, the temporary file > has to be removed manaully. O_TMPFILE, as Darrick has already requested. > > Free blocks consumption > > Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and > then freeing old (non-contig) blocks. Usually the number of old blocks to free equals > to the number the newly allocated blocks. As a finally result, defragmentation doesn't > consume free blocks. Well, that is true if the target file is not sharing blocks with > other files. This is really hard to read. Defragmentation will -always- consume free space while it is progress. It will always release the temporary space it consumes when it completes. > In case the target file contains shared blocks, those shared blocks won't > be freed back to filesystem as they are still owned by other files. So defragmenation > allocates more blocks than it frees. So this is doing an unshare operation as well as defrag? That seems ... suboptimal. The whole point of sharing blocks is to minimise disk usage for duplicated data. > For existing XFS, free blocks might be over- > committed when reflink snapshots were created. To avoid causing the XFS running into > low free blocks state, this defragmentation excludes (partially) shared segments when > the file system free blocks reaches a shreshold. Check the -f option. Again, how is the user supposed to know when they need to do this? If the answer is "they should always avoid defrag on low free space", then why is this an option? > Safty and consistency > > The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel > crash. Which file is the "defragmentation file"? The source or the temp file? > First extent share > > Current kernel has routine for each segment defragmentation detecting if the file is > sharing blocks. I have no idea what this means, or what interface this refers to. > It takes long in case the target file contains huge number of extents > and the shared ones, if there is, are at the end. The First extent share feature works > around above issue by making the first serveral blocks shared. Seeing the first blocks > are shared, the kernel routine ends quickly. The side effect is that the "share" flag > would remain traget file. This feature is enabled by default and can be disabled by -n > option. And from this description, I have no idea what this is doing, what problem it is trying to work around, or why we'd want to share blocks out of a file to speed up detection of whether there are shared blocks in the file. This description doesn't make any sense to me because I don't know what interface you are actually having performance issues with. Please reference the kernel code that is problematic, and explain why the existing kernel code is problematic and cannot be fixed. > extsize and cowextsize > > According to kernel implementation, extsize and cowextsize could have following impacts > to defragmentation: 1) non-zero extsize causes separated block allocations for each > extent in the segment and those blocks are not contiguous. Extent size hints do no such thing. The simply provide extent alignment guidelines and do not affect things like contiguous or multi-block allocation lengths. > The segment remains same > number of extents after defragmention (no effect). 2) When extsize and/or cowextsize > are too big, a lot of pre-allocated blocks remain in memory for a while. When new IO > comes to whose pre-allocated blocks Copy on Write happens and causes the file > fragmented. extsize based unwritten extents won't cause COW or cause fragmentation because they aren't shared and they are contiguous. I suspect that your definition of "fragmented" isn't taking into account that unwritten-written-unwritten over a contiguous range is *not* fragmentation. It's just a contiguous extent in different states, and this should really not be touched/changed by defragmentation. check out xfs_fsr: it ensures that the pattern of unwritten/written blocks in the defragmented file is identical to the source. i.e. it preserves preallocation because the application/fs config wants it to be there.... > Readahead > > Readahead tries to fetch the data blocks for next segment with less locking in > backgroud during idle time. This feature is disabled by default, use -a to enable it. What are you reading ahead into? Kernel page cache or user buffers? Either way, it's hardly what I'd call "idle time" if the defrag process is using it to issue lots of read IO... > The command takes the following options: > -f free_space > The shreshold of XFS free blocks in MiB. When free blocks are less than this > number, (partially) shared segments are excluded from defragmentation. Default > number is 1024 When you are down to 4MB of free space in the filesystem, you shouldn't even be trying to run defrag because all the free space that will be left in the filesystem is single blocks. I would have expected this sort of number to be in a percentage of capacity, defaulting to something like 5% (which is where we start running low space algorithms in the kernel). > -i idle_time > The time in milliseconds, defragmentation enters idle state for this long after > defragmenting a segment and before handing the next. Default number is TOBEDONE. Yeah, I don't think this is something anyonce whould be expected to use or tune. If an idle time is needed, the defrag application should be selecting this itself. > > -s segment_size > The size limitation in bytes of segments. Minimium number is 4MiB, default > number is 16MiB. Why were these numbers chosen? What happens if the file has ~32MB sized extents and the user wants the file to be returned to a single large contiguous extent it possible? i.e. how is the user supposed to know how to set this for any given file without first having examined the exact pattern of fragmentations in the file? > -n Disable the First extent share feature. Enabled by default. So confusing. Is the "feature disable flag" enabled by default, or is the feature enabled by default? > -a Enable readahead feature, disabled by default. Same confusion, but opposite logic. I would highly recommend that you get a native english speaker to review, spell and grammar check the documentation before the next time you post it. > We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice > sleep time. Here comes some number of the test: > > Test: running of defrag on the image file which is used for the back end of a block device in a > virtual machine. At the same time, fio is running at the same time inside virtual machine > on that block device. > block device type: NVME > File size: 200GiB > paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled > Defrag run time: 223 minutes > Number of extents: 6745489(before) -> 203571(after) So and average extent size of ~32kB before, 100MB after? How much of these are shared extents? Runtime is 13380secs, so if we copied 200GiB in that time, the defrag ran at 16MB/s. That's not very fast. What's the CPU utilisation of the defrag task and kernel side processing? What is the difference between "first_extent_share" enabled and disabled (both performance numbers and CPU usage)? > Fio read latency: 15.72ms(without defrag) -> 14.53ms(during defrag) > Fio write latency: 32.21ms(without defrag) -> 20.03ms(during defrag) So the IO latency is *lower* when defrag is running? That doesn't make any sense, unless the fio throughput is massively reduced while defrag is running. What's the throughput change in the fio workload? What's the change in worst case latency for the fio workload? i.e. post the actual fio results so we can see the whole picture of the behaviour, not just a single cherry-picked number. Really, though, I have to ask: why is this an xfs_spaceman command and not something built into the existing online defrag program we have (xfs_fsr)? I'm sure I'll hav emore questions as I go through the code - I'll start at the userspace IO engine part of the patchset so I have some idea of what the defrag algorithm actually is... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx