Re: [PATCH 0/9] introduce defrag to xfs_spaceman

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Fri, 2 Aug 2024 18:14:49 +0000

> On Jul 30, 2024, at 7:51 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Tue, Jul 16, 2024 at 07:45:37PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Jul 15, 2024, at 4:03 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> 
>>> [ Please keep documentation text to 80 columns. ] 
>>> 
>> 
>> Yes. This is not a patch. I copied it from the man 8 output.
>> It will be limited to 80 columns when sent as a patch.
>> 
>>> [ Please run documentation through a spell checker - there are too
>>> many typos in this document to point them all out... ]
>> 
>> OK.
>> 
>>> 
>>> On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote:
>>>> This patch set introduces defrag to xfs_spaceman command. It has the functionality and
>>>> features below (also subject to be added to man page, so please review):
>>> 
>>> What's the use case for this?
>> 
>> This is the user space defrag as you suggested previously.
>> 
>> Please see the previous conversation for your reference: 
>> https://patchwork.kernel.org/project/xfs/cover/20231214170530.8664-1-wen.gang.wang@xxxxxxxxxx/
> 
> That's exactly what you should have put in the cover letter!
> 
> The cover letter is not for documenting the user interface of a new
> tool - that's what the patch in the patch set for the new man page
> should be doing.
> 
> The cover letter should contain references to past patch sets and
> discussions on the topic. The cover letter shoudl also contain a
> changelog that documents what is different in this new version of
> the patch set so reviewers know what you've changed since they last
> looked at it.
> 
> IOWs, the cover letter for explaining the use case, why the
> functionality is needed, important design/implementation decisions
> and the history of the patchset. It's meant to inform and remind
> readers of what has already happened to get to this point.
> 
>> COPY STARTS —————————————> 
>> I am copying your last comment there:
>> 
>> On Tue, Dec 19, 2023 at 09:17:31PM +0000, Wengang Wang wrote:
>>> Hi Dave,
>>> Yes, the user space defrag works and satisfies my requirement (almost no change from your example code).
>> 
>> That's good to know :)
>> 
>>> Let me know if you want it in xfsprog.
>> 
>> Yes, i think adding it as an xfs_spaceman command would be a good
>> way for this defrag feature to be maintained for anyone who has need
>> for it.
> 
> Sure, I might have said that 6 months ago. When presented with a
> completely new implementation in a new context months later, I might
> see things differently.  Everyone is allowed to change their mind,
> opinions and theories as circumstances, evidence and contexts
> change.
> 
> Indeed when I look at this:
> 
>>>>      defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
>>>>             defrag defragments the specified XFS file online non-exclusively. The target XFS
> 
> I didn't expect anything nearly as complex and baroque as this. All
> I was expecting was something like this to defrag a single range of
> a file:
> 
> xfs_spaceman -c "defrag <offset> <length>" <file>
> 
> As the control command, and then functionality for
> automated/periodic scanning and defrag would still end up being
> co-ordinated by the existing xfs_fsr code.
> 
>>> What's "non-exclusively" mean? How is this different to what xfs_fsr
>>> does?
>>> 
>> 
>> I think you have seen the difference when you reviewing more of this set.
>> Well, if I read xfs_fsr code correctly, though xfs_fsr allow parallel writes, it looks have a problem(?)
>> As I read the code, Xfs_fsr do the followings to defrag one file:
>> 1) preallocating blocks to a temporary file hoping the temporary get same number of blocks as the
>>    file under defrag with with less extents.
>> 2) copy data blocks from the file under defrag to the temporary file.
>> 3) switch the extents between the two files.
>> 
>> For stage 2, it’s NOT copying data blocks in atomic manner. Take an example: there need two
>> Read->write pair to complete the data copy, that is
>>    Copy range 1 (read range 1 from the file under defrag to the temporary file)
>>    Copy range 2
> 
> I wasn't asking you to explain to me how the xfs_fsr algorithm
> works. What I was asking for was a definition of what
> "non-exclusively" means.
> 
> What xfs_fsr currently does meets my definition of "non-exclusive" - it does
> not rely on or require exclusive access to the file being
> defragmented except for the atomic extent swap at the end. However,
> using FICLONE/UNSHARE does require exclusive access to the file be
> defragmented for the entirity of those operations, so I don't have
> any real idea of why this new algorithm is explicitly described as
> "non-exclusive".
> 
> Defining terms so everyone has a common understanding is important.
> 
> Indeed, Given that we now have XFS_IOC_EXCHANGE_RANGE, I'm
> definitely starting to wonder if clone/unshare is actually the best
> way to do this now.  I think we could make xfs_fsr do iterative
> small file region defrag using XFS_IOC_EXCHANGE_RANGE instead of
> 'whole file at once' as it does now. If we were also to make fsr
> aware of shared extents 
> 
>>> 
>>>>             Defragmentation and file IOs
>>>> 
>>>>             The target file is virtually devided into many small segments. Segments are the
>>>>             smallest units for defragmentation. Each segment is defragmented one by one in a
>>>>             lock->defragment->unlock->idle manner.
>>> 
>>> Userspace can't easily lock the file to prevent concurrent access.
>>> So I'mnot sure what you are refering to here.
>> 
>> The manner is not simply meant what is done at user space, but a whole thing in both user space
>> and kernel space.  The tool defrag a file segment by segment. The lock->defragment->unlock
>> Is done by kernel in responding to the FALLOC_FL_UNSHARE_RANGE request from user space.
> 
> I'm still not sure what locking you are trying to describe. There
> are multiple layers of locking in the kernel, and we use them
> differently. Indeed, the algorithm you have described is actually
> 
> FICLONERANGE
> IOLOCK shared
> ILOCK exclusive
> remap_file_range()
> IUNLOCK exclusive
> IOUNLOCK shared
> 
> .....
> 
> UNSHARE_RANGE
> IOLOCK exclusive
> MMAPLOCK exclusive
> <drain DIO in flight>
> ILOCK exclusive
> unshare_range()
> IUNLOCK exclusive
> MMAPUNLOCK exclusive
> IOUNLOCK shared
> 
> And so there isn't a single "lock -> defrag -> unlock" context
> occurring - there are multiple independent operations that have
> different kernel side locking contexts and there are no userspace
> side file locking contexts, either.
> 
>>> 
>>>>             File IOs are blocked when the target file is locked and are served during the
>>>>             defragmentation idle time (file is unlocked).
>>> 
>>> What file IOs are being served in parallel? The defragmentation IO?
>>> something else?
>> 
>> Here the file IOs means the IOs request from user space applications including virtual machine
>> Engine.
>> 
>>> 
>>>>             Though
>>>>             the file IOs can't really go in parallel, they are not blocked long. The locking time
>>>>             basically depends on the segment size. Smaller segments usually take less locking time
>>>>             and thus IOs are blocked shorterly, bigger segments usually need more locking time and
>>>>             IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
>>>>             service.
>>> 
>>> How is a user supposed to know what the correct values are for their
>>> storage, files, and workload? Algorithms should auto tune, not
>>> require users and administrators to use trial and error to find the
>>> best numbers to feed a given operation.
>> 
>> In my option, user would need a way to control this according to their use case.
>> Any algorithms will restrict what user want to do.
>> Say, user want the defrag done as quick as possible regardless the resources it takes (CPU, IO and so on)
>> when the production system is in a maintenance window. But when the production system is busy
>> User want the defrag use less resources.
> 
> That's not for the defrag program to implement That's what we use
> resource control groups for. Things like memcgs, block IO cgroups,
> scheduler cgroups, etc. Administrators are used to restricting the
> resources used by applications with generic admin tools; asking them
> to learn how some random admin tool does it's own resrouce
> utilisation restriction that requires careful hand tuning for -one
> off admin events- is not the right way to solve this problem.
> 
> We should be making the admin tool go as fast as possible and
> consume as much resources as are available. This makes it fast out
> of the box, and lets the admins restrict the IO rate, CPU and memory
> usage to bring it down to an acceptible resource usage level for
> admin tasks on their systems.
> 
>> Another example, kernel (algorithms) never knows the maximum IO latency the user applications tolerate.
>> But if you have some algorithms, please share.
> 
> As I said - make it as fast and low latency as reasonably possible.
> If you have less than 10ms IO latency SLAs, the application isn't
> going to be running on sparse, software defined storage that may
> require hundreds of milliseconds of IO pauses during admin tasks.
> Hence design to a max fixed IO latency (say 100ms) and make the
> funcitonality run as fast as possible within that latency window.
> 
> If people need lower latency SLAs, then they shouldn't be running
> that application on sparse, COW based VM images. This is not a
> problem a defrag utility should be trying to solve.
> 
>>>>             Free blocks consumption
>>>> 
>>>>             Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
>>>>             then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
>>>>             to the number the newly allocated blocks. As a finally result, defragmentation doesn't
>>>>             consume free blocks.  Well, that is true if the target file is not sharing blocks with
>>>>             other files.
>>> 
>>> This is really hard to read. Defragmentation will -always- consume
>>> free space while it is progress. It will always release the
>>> temporary space it consumes when it completes.
>> 
>> I don’t think it’s always free blocks when it releases the temporary file. When the blocks were
>> Original shared before defrag, the blocks won’t be freed.
> 
> I didn't make myself clear. If the blocks shared to the temp file
> are owned exclusively by the source file (i.e. they were COW'd from
> shared extents at some time in the past), then that is space
> that is temporarily required by the defragmentation process. UNSHARE
> creates a second, permanent copy of those blocks in the source file
> and closing of the temp file them makes the original exclusively
> owned blocks go away.
> 
> IOWs, defrag can temporarily consume an entire extra file's worth of
> space between the UNSHARE starting and the freeing of the temporary
> file when we are done with it. Freeing the temp file -always-
> releases this extra space, though I note that the implementation is
> to hole-punch it away after each segment has been processed.
> 
>>> 
>>>>             In case the target file contains shared blocks, those shared blocks won't
>>>>             be freed back to filesystem as they are still owned by other files. So defragmenation
>>>>             allocates more blocks than it frees.
>>> 
>>> So this is doing an unshare operation as well as defrag? That seems
>>> ... suboptimal. The whole point of sharing blocks is to minimise
>>> disk usage for duplicated data.
>> 
>> That depends on user's need. If users think defrag is the first
>> priority, it is.  If users don’t think the disk
>> saving is the most important, it is not. No matter what developers think.
>                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> That's pretty ... dismissive.
> 
> I mean, you're flat out wrong. You make the assumption that a user
> knows exactly how every file that every application in their system
> has been created and knows exactly how best to defragment it.
> 
> That's just .... wrong.
> 
> Users and admins do not have intimate knowledge of how their
> applications do their stuff, and a lot of them don't even know
> that their systems are using file clones (i.e. reflink copies)
> instead of data copies extensively these days.
> 
> That is completely the wrong way to approach administration
> tools. 
> 
> Our defragmentation policy for xfs_fsr is to leave the structure of
> the file as intact as possible. That means we replicate unwritten
> regions in the defragmented file. We actually -defragment unwritten
> extents- in xfs_fsr, not just written extents, and we do that
> because we have to assume that the unwritten extents exist for a
> good reason.
> 
> We don't expect the admin to make a decision as to whether unwritten
> extents should be replicated or defragged - we make the assumption
> that either the application or the admin has asked for them to exist
> in the first place.
> 
> It is similar for defragmenting files that are largely made up of shared
> extents. That layout exists for a reason, and it's not the place of
> the defragmentation operation to unilaterally decide layout policies
> for the admin and/or application that is using files with shared
> extents.
> 
> Hence the defrag operation must preserve the *observed intent* of
> the source file layout as much as possible and not require the admin
> or user to be sufficiently informed to make the right decision one
> way or another. We must attempt to preserve the status quo.
> 
> Hence if the file is largely shared, we must not unshare the entire
> file to defragment it unless that is the only way to reduce the
> fragmentation (e.g. resolve small interleaved shared and unshared
> extents). If there are reasonable sized shared extents, we should be
> leaving them alone and not unsharing them just to reduce the extent
> count by a handful of extents.
> 
>> What’s more, reflink (or sharing blocks) is not only used to minimize disk usage. Sometimes it’s
>> Used as way to take snapshots. And those snapshots might won’t stay long.
> 
> Yes, I know this. It doesn't change anything to do with how we
> defragment a file that contains shared blocks.
> 
> If you don't want the snapshot(s) to affect defragmentation, then
> don't run defrag while the snapshots are present. Otherwise, we
> want defrag to retain as much sharing between the snapshots and
> the source file because *minimising the space used by snapshots* is
> the whole point of using file clones for snapshots in the first
> place!
> 
>> And what’s more is that, the unshare operation is what you suggested :D   
> 
> I suggested it as a mechanism to defrag regions of shared files with
> excessive fragmentation. I was not suggesting that "defrag ==
> unshare".
> 
>>>>             For existing XFS, free blocks might be over-
>>>>             committed when reflink snapshots were created. To avoid causing the XFS running into
>>>>             low free blocks state, this defragmentation excludes (partially) shared segments when
>>>>             the file system free blocks reaches a shreshold. Check the -f option.
>>> 
>>> Again, how is the user supposed to know when they need to do this?
>>> If the answer is "they should always avoid defrag on low free
>>> space", then why is this an option?
>> 
>> I didn’t say "they should always avoid defrag on low free space”. And even we can’t say how low is
>> Not tolerated by user, that depends on user use case. Though it’s an option, it has the default value
>> Of 1GB. If users don’t set this option, that is "always avoid defrag on low free space”.
> 
> You didn't answer my question: how is the user supposed to know
> when they should set this?
> 
> And, again, the followup question is: why does this need to be
> built into the defrag tool?
> 
> From a policy perspective, caring about the amount of free space in
> the filesystem isn't the job of a defragmentation operation. It
> should simply abort if it gets an ENOSPC error or fails to improve
> the layout of the file in question. Indeed, if it is obvious that
> there may not be enough free space in the filesystem to begin with
> thendon't run the defrag operation at all.
> 
> This is how xfs_fsr works - it tries to preallocate all the space it
> will need before it starts moving data. If it fails to preallocate
> all the space, it aborts. If it fails to find large enough
> contiguous free spaces to improve the layout of the file, it aborts.
> 
> IOWs, xfs_fsr policy is that it doesn't care about the amount of
> free space in the filesystem, it just cares if the result will
> improve the layout of the file.  That's basically how any online
> background defrag operation should work - if the new
> layout is worse than the existing layout, or there isn't space for
> the new layout to be allocated, just abort.
> 
> 
>>>>             Safty and consistency
>>>> 
>>>>             The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
>>>>             crash.
>>> 
>>> Which file is the "defragmentation file"? The source or the temp
>>> file?
>> 
>> I don’t think there is "source concept" here. There is no data copy between files.
>> “The defragmentation file” means the file under defrag, I will change it to “The file under defrag”.
>> I don’t think users care about the temporary file at all.
> 
> Define the terms you use rather than assuming the reader
> understands both the terminology you are using and the context in
> which you are using them.
> 
> .....
> 
>>> 
>>>>             The command takes the following options:
>>>>                -f free_space
>>>>                    The shreshold of XFS free blocks in MiB. When free blocks are less than this
>>>>                    number, (partially) shared segments are excluded from defragmentation. Default
>>>>                    number is 1024
>>> 
>>> When you are down to 4MB of free space in the filesystem, you
>>> shouldn't even be trying to run defrag because all the free space
>>> that will be left in the filesystem is single blocks. I would have
>>> expected this sort of number to be in a percentage of capacity,
>>> defaulting to something like 5% (which is where we start running low
>>> space algorithms in the kernel).
>> 
>> I would like leave this to user.
> 
> Again: How is the user going to know what to set this to? What
> problem is this avoiding that requires the user to change this in
> any way.
> 
>> When user is doing defrag on low free space system, it won’t cause
>> Problem to file system its self. At most the defrag fails during unshare when allocating blocks.
> 
> Why would we even allow a user to run defrag near ENOSPC? It is a
> well known problem that finding contiguous free space when we are close
> to ENOSPC is difficult and so defrag often is unable to improve the
> situation when we are within a few percent of the filesysetm being
> full.
> 
> It is also a well known problem that defragmentation at low free
> space trades off contiguous free space for fragmented free space.
> Hence when we are at low free space, defrag makes the free space
> fragmetnation worse, which then results in all allocation in the
> filesystem getting worse and more fragmented. This is something we
> absolutely should be trying to avoid.
> 
> This is one of the reasons xfs_fsr tries to layout the entire
> file before doing any IO - when about 95% full, it's common for the
> new layout to be worse than the original file's layout because there
> isn't sufficient contiguous free space to improve the layout.
> 
> IOWs, running defragmentation when we are above 95% full is actively
> harmful to the longevity of the filesystem. Hence, on a fundamental
> level, having a low space threshold in a defragmentation tool is
> simply wrong - defragmentation should simply not be run when the
> filesystem is anywhere near full.
> 
> .....
> 
>>>> 
>>>>                -s segment_size
>>>>                    The size limitation in bytes of segments. Minimium number is 4MiB, default
>>>>                    number is 16MiB.
>>> 
>>> Why were these numbers chosen? What happens if the file has ~32MB
>>> sized extents and the user wants the file to be returned to a single
>>> large contiguous extent it possible? i.e. how is the user supposed
>>> to know how to set this for any given file without first having
>>> examined the exact pattern of fragmentations in the file?
>> 
>> Why customer want the file to be returned to a single large contiguous extent?
>> A 32MB extent is pretty good to me.  I didn’t here any customer
>> complain about 32MB extents…
> 
> There's a much wider world out there than just Oracle customers.
> Just because you aren't aware of other use cases that exist, it
> doesn't mean they don't exist. I know they exist, hence my question.
> 
> For example, extent size hints are used to guarantee that the data
> is aligned to the underlying storage correctly, and very large
> contiguous extents are required to avoid excessive seeks during
> sequential reads that result in critical SLA failures. Hence if a
> file is poorly laid out in this situation, defrag needs to return it
> to as few, maximally sized extents as it can. How does a user know
> what they'd need to set this segment size field to and so acheive
> the result they need?
> 
>> And you know, whether we can defrag extents to a large one depends on not only the tool it’s self.
>> It’s depends on the status of the filesystem, say if the filesystem is very fragmented too, say the AG size..
>> 
>> The 16MB is selected according our tests basing on a customer metadump. With 16MB segment size,
>> The the defrag result is very good and the IO latency is acceptable too.  With the default 16MB segment
>> Size, 32MB extent is excluded from defrag.
> 
> Exactly my point: you have written a solution that works for a
> single filesystem in a single environment.  However, the solution is
> so specific to the single problem you need to solve that it is not
> clear whether that functionality or defaults are valid outside of
> the specific problem case you've written it for and tested it on.
> 
>> If you have better default size, we can use that.
> 
> I'm not convinced that fixed size "segments" is even the right way
> to approach this problem. What needs to be done is dependent on the
> extent layout of the file, not how extents fit over some arbitrary
> fixed segment map....
> 
>>>> We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
>>>> sleep time. Here comes some number of the test:
>>>> 
>>>> Test: running of defrag on the image file which is used for the back end of a block device in a
>>>>     virtual machine. At the same time, fio is running at the same time inside virtual machine
>>>>     on that block device.
>>>> block device type:   NVME
>>>> File size:           200GiB
>>>> paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
>>>> Defrag run time:     223 minutes
>>>> Number of extents:   6745489(before) -> 203571(after)
>>> 
>>> So and average extent size of ~32kB before, 100MB after? How much of
>>> these are shared extents?
>> 
>> Zero shared extents, but there are some unwritten ones.
>> A similar run stats is like this:
>> Pre-defrag 6654460 extents detected, 112228 are "unwritten",0 are "shared” 
>> Tried to defragment 6393352 extents (181000359936 bytes) in 26032 segments Time stats(ms): max clone: 31, max unshare: 300, max punch_hole: 66
>> Post-defrag 282659 extents detected
>> 
>>> 
>>> Runtime is 13380secs, so if we copied 200GiB in that time, the
>>> defrag ran at 16MB/s. That's not very fast.
>>> 
>> 
>> We are chasing the balance of defrag and parallel IO latency.
> 
> My point is that stuff like CLONE and UNSHARE should be able to run
> much, much faster than this, even if some of the time is left idle
> for other IO.
> 
> i.e. we can clone extents at about 100,000/s. We can copy data
> through the page cache at 7-8GB/s on NVMe devices.
> 
> A full clone of the 6.6 million extents should only take about
> a minute.
> 
> A full page cache copy of the 200GB cloned file (i.e. via read/write
> syscalls) should easily run at >1GB/s, and so only take a couple of
> minutes to run.
> 
> IOWs, the actual IO and metadata modification side of things is
> really only about 5 minutes worth of CPU and IO.
> 
> Hence this defrag operation is roughly 100x slower than we should be
> able to run it at.  We should be able to run it at close to those
> speeds whilst still allowing concurrent read access to the file.
> 
> If an admin then wants it to run at 16MB/s, it can be throttled
> to that speed using cgroups, ionice, etc.
> 
> i.e. I think you are trying to solve too many unnecessary problems
> here and not addressing the one thing it should do: defrag a file as
> fast and efficiently as possible.
> 

Thanks for all above replies.
For the performance, I am still thinking that the bottle neck is at the synchronous page by page disk reading.

Yes, I have to address/workaround something in kernel.
Let’s expect the related kernel fixings and then simplify the user space defrag code.