This patch set introduces defrag to xfs_spaceman command. It has the functionality and features below (also subject to be added to man page, so please review): defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a] defrag defragments the specified XFS file online non-exclusively. The target XFS doesn't have to (and must not) be unmunted. When defragmentation is in progress, file IOs are served 'in parallel'. reflink feature must be enabled in the XFS. Defragmentation and file IOs The target file is virtually devided into many small segments. Segments are the smallest units for defragmentation. Each segment is defragmented one by one in a lock->defragment->unlock->idle manner. File IOs are blocked when the target file is locked and are served during the defragmentation idle time (file is unlocked). Though the file IOs can't really go in parallel, they are not blocked long. The locking time basically depends on the segment size. Smaller segments usually take less locking time and thus IOs are blocked shorterly, bigger segments usually need more locking time and IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO service. Temporary file A temporary file is used for the defragmentation. The temporary file is created in the same directory as the target file is and is named ".xfsdefrag_<pid>". It is a sparse file and contains a defragmentation segment at a time. The temporary file is removed automatically when defragmentation is done or is cancelled by ctrl-c. It remains in case kernel crashes when defragmentation is going on. In that case, the temporary file has to be removed manaully. Free blocks consumption Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and then freeing old (non-contig) blocks. Usually the number of old blocks to free equals to the number the newly allocated blocks. As a finally result, defragmentation doesn't consume free blocks. Well, that is true if the target file is not sharing blocks with other files. In case the target file contains shared blocks, those shared blocks won't be freed back to filesystem as they are still owned by other files. So defragmenation allocates more blocks than it frees. For existing XFS, free blocks might be over- committed when reflink snapshots were created. To avoid causing the XFS running into low free blocks state, this defragmentation excludes (partially) shared segments when the file system free blocks reaches a shreshold. Check the -f option. Safty and consistency The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel crash. First extent share Current kernel has routine for each segment defragmentation detecting if the file is sharing blocks. It takes long in case the target file contains huge number of extents and the shared ones, if there is, are at the end. The First extent share feature works around above issue by making the first serveral blocks shared. Seeing the first blocks are shared, the kernel routine ends quickly. The side effect is that the "share" flag would remain traget file. This feature is enabled by default and can be disabled by -n option. extsize and cowextsize According to kernel implementation, extsize and cowextsize could have following impacts to defragmentation: 1) non-zero extsize causes separated block allocations for each extent in the segment and those blocks are not contiguous. The segment remains same number of extents after defragmention (no effect). 2) When extsize and/or cowextsize are too big, a lot of pre-allocated blocks remain in memory for a while. When new IO comes to whose pre-allocated blocks Copy on Write happens and causes the file fragmented. Readahead Readahead tries to fetch the data blocks for next segment with less locking in backgroud during idle time. This feature is disabled by default, use -a to enable it. The command takes the following options: -f free_space The shreshold of XFS free blocks in MiB. When free blocks are less than this number, (partially) shared segments are excluded from defragmentation. Default number is 1024 -i idle_time The time in milliseconds, defragmentation enters idle state for this long after defragmenting a segment and before handing the next. Default number is TOBEDONE. -s segment_size The size limitation in bytes of segments. Minimium number is 4MiB, default number is 16MiB. -n Disable the First extent share feature. Enabled by default. -a Enable readahead feature, disabled by default. We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice sleep time. Here comes some number of the test: Test: running of defrag on the image file which is used for the back end of a block device in a virtual machine. At the same time, fio is running at the same time inside virtual machine on that block device. block device type: NVME File size: 200GiB paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled Defrag run time: 223 minutes Number of extents: 6745489(before) -> 203571(after) Fio read latency: 15.72ms(without defrag) -> 14.53ms(during defrag) Fio write latency: 32.21ms(without defrag) -> 20.03ms(during defrag) Wengang Wang (9): xfsprogs: introduce defrag command to spaceman spaceman/defrag: pick up segments from target file spaceman/defrag: defrag segments spaceman/defrag: ctrl-c handler spaceman/defrag: exclude shared segments on low free space spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag() spaceman/defrag: sleeps between segments spaceman/defrag: readahead for better performance spaceman/defrag: warn on extsize spaceman/Makefile | 2 +- spaceman/defrag.c | 788 ++++++++++++++++++++++++++++++++++++++++++++++ spaceman/init.c | 1 + spaceman/space.h | 1 + 4 files changed, 791 insertions(+), 1 deletion(-) create mode 100644 spaceman/defrag.c -- 2.39.3 (Apple Git-146)