> On Jul 16, 2024, at 9:11 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Tue, Jul 16, 2024 at 08:23:35PM +0000, Wengang Wang wrote: >>> Ok, so this is a linear iteration of all extents in the file that >>> filters extents for the specific "segment" that is going to be >>> processed. I still have no idea why fixed length segments are >>> important, but "linear extent scan for filtering" seems somewhat >>> expensive. >> >> Hm… fixed length segments — actually not fixed length segments, but segment >> size can’t exceed the limitation. So segment.ds_length <= LIMIT. > > Which is effectively fixed length segments.... > >> Larger segment take longer time (with filed locked) to defrag. The >> segment size limit is a way to balance the defrag and the parallel >> IO latency. > > Yes, I know why you've done it. These were the same arguments made a > while back for a new way of cloning files on XFS. We solved those > problems just with a small change to the locking, and didn't need > new ioctls or lots of new code just to solve the "clone blocks > concurrent IO" problem. > > I'm looking at this from exactly the same POV. The code presented is > doing lots of complex, unusable stuff to work around the fact that > UNSHARE blocks concurrent IO. I don't see any difference between > CLONE and UNSHARE from the IO perspective - if anything UNSHARE can > have looser rules than CLONE, because a concurrent write will either > do the COW of a shared block itself, or hit the exclusive block that > has already been unshared. > > So if we fix these locking issues in the kernel, then the whole need > for working around the IO concurrency problems with UNSHARE goes > away and the userspace code becomes much, much simpler. > >>> Indeed, if you used FIEMAP, you can pass a minimum >>> segment length to filter out all the small extents. Iterating that >>> extent list means all the ranges you need to defrag are in the holes >>> of the returned mapping information. This would be much faster >>> than an entire linear mapping to find all the regions with small >>> extents that need defrag. The second step could then be doing a >>> fine grained mapping of each region that we now know either contains >>> fragmented data or holes.... Where can we pass a minimum segment length to filter out things? I don’t see that in the fiemap structure: A fiemap request is encoded within struct fiemap: struct fiemap { __u64 fm_start; /* logical offset (inclusive) at * which to start mapping (in) */ __u64 fm_length; /* logical length of mapping which * userspace cares about (in) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ __u32 fm_mapped_extents; /* number of extents that were * mapped (out) */ __u32 fm_extent_count; /* size of fm_extents array (in) */ __u32 fm_reserved; struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ }; Thanks, Wengang >> >> Hm… just a question here: >> As your way, say you set the filter length to 2048, all extents with 2048 or less blocks are to defragmented. >> What if the extent layout is like this: >> >> 1. 1 >> 2. 2049 >> 3. 2 >> 4. 2050 >> 5. 1 >> 6. 2051 >> >> In above case, do you do defrag or not? > > The filtering presenting in the patch above will not defrag any of > this with a 2048 block segment side, because the second extent in > each segment extend beyond the configured max segment length. IOWs, > it ends up with a single extent per "2048 block segment", and that > won't get defragged with the current algorithm. > > As it is, this really isn't a common fragmentation pattern for a > file that does not contain shared extents, so I wouldn't expect to > ever need to decide if this needs to be defragged or not. > > However, it is exactly the layout I would expect to see for cloned > and modified filesystem image files. That is, the common layout for > such a "cloned from golden image" Vm images is this: > > 1. 1 written > 2. 2049 shared > 3. 2 written > 4. 2050 shared > 5. 1 written > 6. 2051 shared > > i.e. there are large chunks of contiguous shared extents between the > small individual COW block modifications that have been made to > customise the image for the deployed VM. > > Either way, if the segment/filter length is 2048 blocks, then this > isn't a pattern that should be defragmented. If the segment/filter > length is 4096 or larger, then yes, this pattern should definitely > be defragmented. > >> As I understand the situation, performance of defrag it’s self is >> not a critical concern here. > > Sure, but implementing a low performing, high CPU consumption, > entirely single threaded defragmentation model that requires > specific tuning in every different environment it is run in doesn't > seem like the best idea to me. > > I'm trying to work out if there is a faster, simpler way of > achieving the same goal.... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx