Re: [PATCH 0/9] xfs file non-exclusive online defragment

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Fri, 15 Dec 2023 20:03:03 +0000

> On Dec 15, 2023, at 9:30 AM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> 
> On Fri, Dec 15, 2023 at 05:07:36PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Dec 14, 2023, at 7:15 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> 
>>> On Thu, Dec 14, 2023 at 01:35:02PM -0800, Darrick J. Wong wrote:
>>>> On Thu, Dec 14, 2023 at 09:05:21AM -0800, Wengang Wang wrote:
>>>>> Background:
>>>>> We have the existing xfs_fsr tool which do defragment for files. It has the
>>>>> following features:
>>>>> 1. Defragment is implemented by file copying.
>>>>> 2. The copy (to a temporary file) is exclusive. The source file is locked
>>>>>  during the copy (to a temporary file) and all IO requests are blocked
>>>>>  before the copy is done.
>>>>> 3. The copy could take long time for huge files with IO blocked.
>>>>> 4. The copy requires as many free blocks as the source file has.
>>>>>  If the source is huge, say it’s 1TiB,  it’s hard to require the file
>>>>>  system to have another 1TiB free.
>>>>> 
>>>>> The use case in concern is that the XFS files are used as images files for
>>>>> Virtual Machines.
>>>>> 1. The image files are huge, they can reach hundreds of GiB and even to TiB.
>>>>> 2. Backups are made via reflink copies, and CoW makes the files badly fragmented.
>>>>> 3. fragmentation make reflink copies super slow.
>>>>> 4. during the reflink copy, all IO requests to the file are blocked for super
>>>>>  long time. That makes timeout in VM and the timeout lead to disaster.
>>>>> 
>>>>> This feature aims to:
>>>>> 1. reduce the file fragmentation making future reflink (much) faster and
>>>>> 2. at the same time,  defragmentation works in non-exclusive manner, it doesn’t
>>>>>  block file IOs long.
>>>>> 
>>>>> Non-exclusive defragment
>>>>> Here we are introducing the non-exclusive manner to defragment a file,
>>>>> especially for huge files, without blocking IO to it long. Non-exclusive
>>>>> defragmentation divides the whole file into small pieces. For each piece,
>>>>> we lock the file, defragment the piece and unlock the file. Defragmenting
>>>>> the small piece doesn’t take long. File IO requests can get served between
>>>>> pieces before blocked long.  Also we put (user adjustable) idle time between
>>>>> defragmenting two consecutive pieces to balance the defragmentation and file IOs.
>>>>> So though the defragmentation could take longer than xfs_fsr,  it balances
>>>>> defragmentation and file IOs.
>>>> 
>>>> I'm kinda surprised you don't just turn on alwayscow mode, use an
>>>> iomap_funshare-like function to read in and dirty pagecache (which will
>>>> hopefully create a new large cow fork mapping) and then flush it all
>>>> back out with writeback.  Then you don't need all this state tracking,
>>>> kthreads management, and copying file data through the buffer cache.
>>>> Wouldn't that be a lot simpler?
>>> 
>>> Hmmm. I don't think it needs any kernel code to be written at all.
>>> I think we can do atomic section-by-section, crash-safe active file
>>> defrag from userspace like this:
>>> 
>>> scratch_fd = open(O_TMPFILE);
>>> defrag_fd = open("file-to-be-dfragged");
>>> 
>>> while (offset < target_size) {
>>> 
>>> /*
>>> * share a range of the file to be defragged into
>>> * the scratch file.
>>> */
>>> args.src_fd = defrag_fd;
>>> args.src_offset = offset;
>>> args.src_len = length;
>>> args.dst_offset = offset;
>>> ioctl(scratch_fd, FICLONERANGE, args);
>>> 
>>> /*
>>> * For the shared range to be unshared via a
>>> * copy-on-write operation in the file to be
>>> * defragged. This causes the file needing to be
>>> * defragged to have new extents allocated and the
>>> * data to be copied over and written out.
>>> */
>>> fallocate(defrag_fd, FALLOC_FL_UNSHARE_RANGE, offset, length);
>>> fdatasync(defrag_fd);
>>> 
>>> /*
>>> * Punch out the original extents we shared to the
>>> * scratch file so they are returned to free space.
>>> */
>>> fallocate(scratch_fd, FALLOC_FL_PUNCH, offset, length);
> 
> You could even set args.dst_offset = 0 and ftruncate here.
> 
> But yes, this is a better suggestion than adding more kernel code.
> 
>>> /* move onto next region */
>>> offset += length;
>>> };
>>> 
>>> As long as the length is large enough for the unshare to create a
>>> large contiguous delalloc region for the COW, I think this would
>>> likely acheive the desired "non-exclusive" defrag requirement.
>>> 
>>> If we were to implement this as, say, and xfs_spaceman operation
>>> then all the user controlled policy bits (like inter chunk delays,
>>> chunk sizes, etc) then just becomes command line parameters for the
>>> defrag command...
>> 
>> 
>> Ha, the idea from user space is very interesting!
>> So far I have the following thoughts:
>> 1). If the FICLONERANGE/FALLOC_FL_UNSHARE_RANGE/FALLOC_FL_PUNCH works
>> on a FS without reflink enabled.
> 
> It does not.
> 
> That said, for your usecase (reflinked vm disk images that fragment over
> time) that won't be an issue.  For non-reflink filesystems, there's
> fewer chances for extreme fragmentation due to the lack of COW.
> 
>> 2). What if there is a big hole in the file to be defragmented? Will
>> it cause block allocation and writing blocks with zeroes.
> 
> FUNSHARE ignores holes.
> 
>> 3). In case a big range of the file is good (not much fragmented), the
>> ‘defrag’ on that range is not necessary.
> 
> Yep, so you'd have to check the bmap/fiemap output first to identify
> areas that are more fragmented than you'd like.
> 
>> 4). The use space defrag can’t use a try-lock mode to make IO requests
>> have priorities. I am not sure if this is very important.
>> 
>> Maybe we can work with xfs_bmap to get extents info and skip good
>> extents and holes to help case 2) and 3).
> 
> Yeah, that sounds necessary.
> 

Thanks for answering!
Wengang