On Thu, Dec 14, 2023 at 01:35:02PM -0800, Darrick J. Wong wrote: > On Thu, Dec 14, 2023 at 09:05:21AM -0800, Wengang Wang wrote: > > Background: > > We have the existing xfs_fsr tool which do defragment for files. It has the > > following features: > > 1. Defragment is implemented by file copying. > > 2. The copy (to a temporary file) is exclusive. The source file is locked > > during the copy (to a temporary file) and all IO requests are blocked > > before the copy is done. > > 3. The copy could take long time for huge files with IO blocked. > > 4. The copy requires as many free blocks as the source file has. > > If the source is huge, say it’s 1TiB, it’s hard to require the file > > system to have another 1TiB free. > > > > The use case in concern is that the XFS files are used as images files for > > Virtual Machines. > > 1. The image files are huge, they can reach hundreds of GiB and even to TiB. > > 2. Backups are made via reflink copies, and CoW makes the files badly fragmented. > > 3. fragmentation make reflink copies super slow. > > 4. during the reflink copy, all IO requests to the file are blocked for super > > long time. That makes timeout in VM and the timeout lead to disaster. > > > > This feature aims to: > > 1. reduce the file fragmentation making future reflink (much) faster and > > 2. at the same time, defragmentation works in non-exclusive manner, it doesn’t > > block file IOs long. > > > > Non-exclusive defragment > > Here we are introducing the non-exclusive manner to defragment a file, > > especially for huge files, without blocking IO to it long. Non-exclusive > > defragmentation divides the whole file into small pieces. For each piece, > > we lock the file, defragment the piece and unlock the file. Defragmenting > > the small piece doesn’t take long. File IO requests can get served between > > pieces before blocked long. Also we put (user adjustable) idle time between > > defragmenting two consecutive pieces to balance the defragmentation and file IOs. > > So though the defragmentation could take longer than xfs_fsr, it balances > > defragmentation and file IOs. > > I'm kinda surprised you don't just turn on alwayscow mode, use an > iomap_funshare-like function to read in and dirty pagecache (which will > hopefully create a new large cow fork mapping) and then flush it all > back out with writeback. Then you don't need all this state tracking, > kthreads management, and copying file data through the buffer cache. > Wouldn't that be a lot simpler? Hmmm. I don't think it needs any kernel code to be written at all. I think we can do atomic section-by-section, crash-safe active file defrag from userspace like this: scratch_fd = open(O_TMPFILE); defrag_fd = open("file-to-be-dfragged"); while (offset < target_size) { /* * share a range of the file to be defragged into * the scratch file. */ args.src_fd = defrag_fd; args.src_offset = offset; args.src_len = length; args.dst_offset = offset; ioctl(scratch_fd, FICLONERANGE, args); /* * For the shared range to be unshared via a * copy-on-write operation in the file to be * defragged. This causes the file needing to be * defragged to have new extents allocated and the * data to be copied over and written out. */ fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, offset, length); fdatasync(defrag_fd); /* * Punch out the original extents we shared to the * scratch file so they are returned to free space. */ fallocate(scratch_fd, FALLOC_FL_PUNCH, offset, length); /* move onto next region */ offset += length; }; As long as the length is large enough for the unshare to create a large contiguous delalloc region for the COW, I think this would likely acheive the desired "non-exclusive" defrag requirement. If we were to implement this as, say, and xfs_spaceman operation then all the user controlled policy bits (like inter chunk delays, chunk sizes, etc) then just becomes command line parameters for the defrag command... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx