> On Dec 15, 2023, at 9:30 AM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > On Fri, Dec 15, 2023 at 05:07:36PM +0000, Wengang Wang wrote: >> >> >>> On Dec 14, 2023, at 7:15 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>> >>> On Thu, Dec 14, 2023 at 01:35:02PM -0800, Darrick J. Wong wrote: >>>> On Thu, Dec 14, 2023 at 09:05:21AM -0800, Wengang Wang wrote: >>>>> Background: >>>>> We have the existing xfs_fsr tool which do defragment for files. It has the >>>>> following features: >>>>> 1. Defragment is implemented by file copying. >>>>> 2. The copy (to a temporary file) is exclusive. The source file is locked >>>>> during the copy (to a temporary file) and all IO requests are blocked >>>>> before the copy is done. >>>>> 3. The copy could take long time for huge files with IO blocked. >>>>> 4. The copy requires as many free blocks as the source file has. >>>>> If the source is huge, say it’s 1TiB, it’s hard to require the file >>>>> system to have another 1TiB free. >>>>> >>>>> The use case in concern is that the XFS files are used as images files for >>>>> Virtual Machines. >>>>> 1. The image files are huge, they can reach hundreds of GiB and even to TiB. >>>>> 2. Backups are made via reflink copies, and CoW makes the files badly fragmented. >>>>> 3. fragmentation make reflink copies super slow. >>>>> 4. during the reflink copy, all IO requests to the file are blocked for super >>>>> long time. That makes timeout in VM and the timeout lead to disaster. >>>>> >>>>> This feature aims to: >>>>> 1. reduce the file fragmentation making future reflink (much) faster and >>>>> 2. at the same time, defragmentation works in non-exclusive manner, it doesn’t >>>>> block file IOs long. >>>>> >>>>> Non-exclusive defragment >>>>> Here we are introducing the non-exclusive manner to defragment a file, >>>>> especially for huge files, without blocking IO to it long. Non-exclusive >>>>> defragmentation divides the whole file into small pieces. For each piece, >>>>> we lock the file, defragment the piece and unlock the file. Defragmenting >>>>> the small piece doesn’t take long. File IO requests can get served between >>>>> pieces before blocked long. Also we put (user adjustable) idle time between >>>>> defragmenting two consecutive pieces to balance the defragmentation and file IOs. >>>>> So though the defragmentation could take longer than xfs_fsr, it balances >>>>> defragmentation and file IOs. >>>> >>>> I'm kinda surprised you don't just turn on alwayscow mode, use an >>>> iomap_funshare-like function to read in and dirty pagecache (which will >>>> hopefully create a new large cow fork mapping) and then flush it all >>>> back out with writeback. Then you don't need all this state tracking, >>>> kthreads management, and copying file data through the buffer cache. >>>> Wouldn't that be a lot simpler? >>> >>> Hmmm. I don't think it needs any kernel code to be written at all. >>> I think we can do atomic section-by-section, crash-safe active file >>> defrag from userspace like this: >>> >>> scratch_fd = open(O_TMPFILE); >>> defrag_fd = open("file-to-be-dfragged"); >>> >>> while (offset < target_size) { >>> >>> /* >>> * share a range of the file to be defragged into >>> * the scratch file. >>> */ >>> args.src_fd = defrag_fd; >>> args.src_offset = offset; >>> args.src_len = length; >>> args.dst_offset = offset; >>> ioctl(scratch_fd, FICLONERANGE, args); >>> >>> /* >>> * For the shared range to be unshared via a >>> * copy-on-write operation in the file to be >>> * defragged. This causes the file needing to be >>> * defragged to have new extents allocated and the >>> * data to be copied over and written out. >>> */ >>> fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, offset, length); >>> fdatasync(defrag_fd); >>> >>> /* >>> * Punch out the original extents we shared to the >>> * scratch file so they are returned to free space. >>> */ >>> fallocate(scratch_fd, FALLOC_FL_PUNCH, offset, length); > > You could even set args.dst_offset = 0 and ftruncate here. > > But yes, this is a better suggestion than adding more kernel code. > >>> /* move onto next region */ >>> offset += length; >>> }; >>> >>> As long as the length is large enough for the unshare to create a >>> large contiguous delalloc region for the COW, I think this would >>> likely acheive the desired "non-exclusive" defrag requirement. >>> >>> If we were to implement this as, say, and xfs_spaceman operation >>> then all the user controlled policy bits (like inter chunk delays, >>> chunk sizes, etc) then just becomes command line parameters for the >>> defrag command... >> >> >> Ha, the idea from user space is very interesting! >> So far I have the following thoughts: >> 1). If the FICLONERANGE/FALLOC_FL_UNSHARE_RANGE/FALLOC_FL_PUNCH works >> on a FS without reflink enabled. > > It does not. > > That said, for your usecase (reflinked vm disk images that fragment over > time) that won't be an issue. For non-reflink filesystems, there's > fewer chances for extreme fragmentation due to the lack of COW. > >> 2). What if there is a big hole in the file to be defragmented? Will >> it cause block allocation and writing blocks with zeroes. > > FUNSHARE ignores holes. > >> 3). In case a big range of the file is good (not much fragmented), the >> ‘defrag’ on that range is not necessary. > > Yep, so you'd have to check the bmap/fiemap output first to identify > areas that are more fragmented than you'd like. > >> 4). The use space defrag can’t use a try-lock mode to make IO requests >> have priorities. I am not sure if this is very important. >> >> Maybe we can work with xfs_bmap to get extents info and skip good >> extents and holes to help case 2) and 3). > > Yeah, that sounds necessary. > Thanks for answering! Wengang