Re: [PATCH 0/9] introduce defrag to xfs_spaceman

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Jul 15, 2024, at 4:03 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> [ Please keep documentation text to 80 columns. ] 
> 

Yes. This is not a patch. I copied it from the man 8 output.
It will be limited to 80 columns when sent as a patch.

> [ Please run documentation through a spell checker - there are too
> many typos in this document to point them all out... ]

OK.

> 
> On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote:
>> This patch set introduces defrag to xfs_spaceman command. It has the functionality and
>> features below (also subject to be added to man page, so please review):
> 
> What's the use case for this?

This is the user space defrag as you suggested previously.

Please see the previous conversation for your reference: 
https://patchwork.kernel.org/project/xfs/cover/20231214170530.8664-1-wen.gang.wang@xxxxxxxxxx/

COPY STARTS —————————————> 
I am copying your last comment there:

On Tue, Dec 19, 2023 at 09:17:31PM +0000, Wengang Wang wrote:
> Hi Dave,
> Yes, the user space defrag works and satisfies my requirement (almost no change from your example code).

That's good to know :)

> Let me know if you want it in xfsprog.

Yes, i think adding it as an xfs_spaceman command would be a good
way for this defrag feature to be maintained for anyone who has need
for it.

-Dave.
<———————————————— COPY ENDS

> 
>>       defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
>>              defrag defragments the specified XFS file online non-exclusively. The target XFS
> 
> What's "non-exclusively" mean? How is this different to what xfs_fsr
> does?
> 

I think you have seen the difference when you reviewing more of this set.
Well, if I read xfs_fsr code correctly, though xfs_fsr allow parallel writes, it looks have a problem(?)
As I read the code, Xfs_fsr do the followings to defrag one file:
1) preallocating blocks to a temporary file hoping the temporary get same number of blocks as the
    file under defrag with with less extents.
2) copy data blocks from the file under defrag to the temporary file.
3) switch the extents between the two files.

For stage 2, it’s NOT copying data blocks in atomic manner. Take an example: there need two
Read->write pair to complete the data copy, that is
    Copy range 1 (read range 1 from the file under defrag to the temporary file)
    Copy range 2

If a new write come to the file (range 1) under defrag after copying range1 is done. After the defrag
(xfs_fsr) finished, will the new write lose?

I didn’t look into the extents-switch code, don’t know if that check if the two files has same data contents.
But even it does, it would be pretty slow with file locked.    


>>              doesn't have to (and must not) be unmunted.  When defragmentation is in progress, file
>>              IOs are served 'in parallel'.  reflink feature must be enabled in the XFS.
> 
> xfs_fsr allows IO to occur in parallel to defrag.

Pls see my concern above.

> 
>>              Defragmentation and file IOs
>> 
>>              The target file is virtually devided into many small segments. Segments are the
>>              smallest units for defragmentation. Each segment is defragmented one by one in a
>>              lock->defragment->unlock->idle manner.
> 
> Userspace can't easily lock the file to prevent concurrent access.
> So I'mnot sure what you are refering to here.

The manner is not simply meant what is done at user space, but a whole thing in both user space
and kernel space.  The tool defrag a file segment by segment. The lock->defragment->unlock
Is done by kernel in responding to the FALLOC_FL_UNSHARE_RANGE request from user space.

> 
>>              File IOs are blocked when the target file is locked and are served during the
>>              defragmentation idle time (file is unlocked).
> 
> What file IOs are being served in parallel? The defragmentation IO?
> something else?

Here the file IOs means the IOs request from user space applications including virtual machine
Engine.

> 
>>              Though
>>              the file IOs can't really go in parallel, they are not blocked long. The locking time
>>              basically depends on the segment size. Smaller segments usually take less locking time
>>              and thus IOs are blocked shorterly, bigger segments usually need more locking time and
>>              IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
>>              service.
> 
> How is a user supposed to know what the correct values are for their
> storage, files, and workload? Algorithms should auto tune, not
> require users and administrators to use trial and error to find the
> best numbers to feed a given operation.

In my option, user would need a way to control this according to their use case.
Any algorithms will restrict what user want to do.
Say, user want the defrag done as quick as possible regardless the resources it takes (CPU, IO and so on)
when the production system is in a maintenance window. But when the production system is busy
User want the defrag use less resources.
Another example, kernel (algorithms) never knows the maximum IO latency the user applications tolerate.
But if you have some algorithms, please share.

And we provide default numbers for the options, they come from test practice though user might need to
Change them for their own use case.

> 
>>              Temporary file
>> 
>>              A temporary file is used for the defragmentation. The temporary file is created in the
>>              same directory as the target file is and is named ".xfsdefrag_<pid>". It is a sparse
>>              file and contains a defragmentation segment at a time. The temporary file is removed
>>              automatically when defragmentation is done or is cancelled by ctrl-c. It remains in
>>              case kernel crashes when defragmentation is going on. In that case, the temporary file
>>              has to be removed manaully.
> 
> O_TMPFILE, as Darrick has already requested.

OK. Will be it.
> 
>> 
>>              Free blocks consumption
>> 
>>              Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
>>              then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
>>              to the number the newly allocated blocks. As a finally result, defragmentation doesn't
>>              consume free blocks.  Well, that is true if the target file is not sharing blocks with
>>              other files.
> 
> This is really hard to read. Defragmentation will -always- consume
> free space while it is progress. It will always release the
> temporary space it consumes when it completes.

I don’t think it’s always free blocks when it releases the temporary file. When the blocks were
Original shared before defrag, the blocks won’t be freed.

> 
>>              In case the target file contains shared blocks, those shared blocks won't
>>              be freed back to filesystem as they are still owned by other files. So defragmenation
>>              allocates more blocks than it frees.
> 
> So this is doing an unshare operation as well as defrag? That seems
> ... suboptimal. The whole point of sharing blocks is to minimise
> disk usage for duplicated data.

That depends on user's need. If users think defrag is the first priority, it is. If users don’t think the disk
saving is the most important, it is not. No matter what developers think.
What’s more, reflink (or sharing blocks) is not only used to minimize disk usage. Sometimes it’s
Used as way to take snapshots. And those snapshots might won’t stay long.

And what’s more is that, the unshare operation is what you suggested :D   


> 
>>              For existing XFS, free blocks might be over-
>>              committed when reflink snapshots were created. To avoid causing the XFS running into
>>              low free blocks state, this defragmentation excludes (partially) shared segments when
>>              the file system free blocks reaches a shreshold. Check the -f option.
> 
> Again, how is the user supposed to know when they need to do this?
> If the answer is "they should always avoid defrag on low free
> space", then why is this an option?

I didn’t say "they should always avoid defrag on low free space”. And even we can’t say how low is
Not tolerated by user, that depends on user use case. Though it’s an option, it has the default value
Of 1GB. If users don’t set this option, that is "always avoid defrag on low free space”.


> 
>>              Safty and consistency
>> 
>>              The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
>>              crash.
> 
> Which file is the "defragmentation file"? The source or the temp
> file?

I don’t think there is "source concept" here. There is no data copy between files.
“The defragmentation file” means the file under defrag, I will change it to “The file under defrag”.
I don’t think users care about the temporary file at all.


> 
>>              First extent share
>> 
>>              Current kernel has routine for each segment defragmentation detecting if the file is
>>              sharing blocks.
> 
> I have no idea what this means, or what interface this refers to.
> 
>>              It takes long in case the target file contains huge number of extents
>>              and the shared ones, if there is, are at the end. The First extent share feature works
>>              around above issue by making the first serveral blocks shared. Seeing the first blocks
>>              are shared, the kernel routine ends quickly. The side effect is that the "share" flag
>>              would remain traget file. This feature is enabled by default and can be disabled by -n
>>              option.
> 
> And from this description, I have no idea what this is doing, what
> problem it is trying to work around, or why we'd want to share
> blocks out of a file to speed up detection of whether there are
> shared blocks in the file. This description doesn't make any sense
> to me because I don't know what interface you are actually having
> performance issues with. Please reference the kernel code that is
> problematic, and explain why the existing kernel code is problematic
> and cannot be fixed.

I mentioned the kernel function name in patch 6. It is xfs_reflink_try_clear_inode_flag().

> 
>>              extsize and cowextsize
>> 
>>              According to kernel implementation, extsize and cowextsize could have following impacts
>>              to defragmentation: 1) non-zero extsize causes separated block allocations for each
>>              extent in the segment and those blocks are not contiguous.
> 
> Extent size hints do no such thing. The simply provide extent
> alignment guidelines and do not affect things like contiguous or
> multi-block allocation lengths.

Extsize really make alignment on the number of blocks to allocate. But it affects more than that.
When extsize is set, the allocations is not delayed allocation.
xfs_reflink_unshare() does one allocation each extent. For a defrag segment containing
N extents, there are N allocations.

> 
>>              The segment remains same
>>              number of extents after defragmention (no effect).  2) When extsize and/or cowextsize
>>              are too big, a lot of pre-allocated blocks remain in memory for a while. When new IO
>>              comes to whose pre-allocated blocks  Copy on Write happens and causes the file
>>              fragmented.
> 
> extsize based unwritten extents won't cause COW or cause
> fragmentation because they aren't shared and they are contiguous.
> I suspect that your definition of "fragmented" isn't taking into
> account that unwritten-written-unwritten over a contiguous range
> is *not* fragmentation. It's just a contiguous extent in different
> states, and this should really not be touched/changed by
> defragmentation.

Are you sure about that? In my option, take the buffer’s write as example,
During writeback, when the target block is found in Cow fork, Copy on Write just happens no matter
If the block is really shared or not.  Let’s see this simple example:
1) a file contains 4 blocks. file block 0, 1 and 2 are shared and block 3 is not share. 
    Extsize on this file is 4 blocks.
2) a writeback come to file block 0, 1 and 2.
3) On seeing those 3 blocks are shared, kernel pre-allocate blocks in Cow fork.
    Extsize being 4 blocks, after alignment, 4 blocks (unwritten) are allocated in Cow fork.
4) data is written to 3 of the blocks in Cow fork. In IO done callback, those 3 blocks in Cow fork
    Is moved to data fork, the original 3 blocks in data fork are freed.

The Copy on Write is done, right?
But remember, there is 1 unwritten block left in the Cow fork.
In case now a new writeback come to file block 3, the kernel see there is a file block 3 in Cow fork,
A new Copy on Write happens.
 

> 
> check out xfs_fsr: it ensures that the pattern of unwritten/written
> blocks in the defragmented file is identical to the source. i.e. it
> preserves preallocation because the application/fs config wants it
> to be there....
> 
>>              Readahead
>> 
>>              Readahead tries to fetch the data blocks for next segment with less locking in
>>              backgroud during idle time. This feature is disabled by default, use -a to enable it.
> 
> What are you reading ahead into? Kernel page cache or user buffers?
Kernel page cache.
> Either way, it's hardly what I'd call "idle time" if the defrag
> process is using it to issue lots of read IO...
> 

During the “idle time”, the file is not (IOLOCK) locked though disk fetching might be happening. 

> 
>>              The command takes the following options:
>>                 -f free_space
>>                     The shreshold of XFS free blocks in MiB. When free blocks are less than this
>>                     number, (partially) shared segments are excluded from defragmentation. Default
>>                     number is 1024
> 
> When you are down to 4MB of free space in the filesystem, you
> shouldn't even be trying to run defrag because all the free space
> that will be left in the filesystem is single blocks. I would have
> expected this sort of number to be in a percentage of capacity,
> defaulting to something like 5% (which is where we start running low
> space algorithms in the kernel).

I would like leave this to user. When user is doing defrag on low free space system, it won’t cause
Problem to file system its self. At most the defrag fails during unshare when allocating blocks.
You can’t prevent user from writing to new file when system is low free space either.

I don’t think a percentage is a good idea,  say, for a 10TiB filesystem, 5% is 512GB.  512GB is pretty
Enough to do things. And for a small one, say a 512 MB filesystem, 5% that’s 25MB, that’s too less.
In above cases, limiting by a percentage would ether prevent user doing something that can be done
Without any problem, or allow user to do something that might cause problem.
I’d think specifying a fixed safe size is better.


> 
>>                 -i idle_time
>>                     The time in milliseconds, defragmentation enters idle state for this long after
>>                     defragmenting a segment and before handing the next. Default number is TOBEDONE.
> 
> Yeah, I don't think this is something anyonce whould be expected to
> use or tune. If an idle time is needed, the defrag application
> should be selecting this itself.

I don’t think so, see my explain above.

>> 
>>                 -s segment_size
>>                     The size limitation in bytes of segments. Minimium number is 4MiB, default
>>                     number is 16MiB.
> 
> Why were these numbers chosen? What happens if the file has ~32MB
> sized extents and the user wants the file to be returned to a single
> large contiguous extent it possible? i.e. how is the user supposed
> to know how to set this for any given file without first having
> examined the exact pattern of fragmentations in the file?

Why customer want the file to be returned to a single large contiguous extent?
A 32MB extent is pretty good to me.  I didn’t here any customer complain about 32MB extents…
And you know, whether we can defrag extents to a large one depends on not only the tool it’s self.
It’s depends on the status of the filesystem, say if the filesystem is very fragmented too, say the AG size..

The 16MB is selected according our tests basing on a customer metadump. With 16MB segment size,
The the defrag result is very good and the IO latency is acceptable too.  With the default 16MB segment
Size, 32MB extent is excluded from defrag.

If you have better default size, we can use that.

> 
>>                 -n  Disable the First extent share feature. Enabled by default.
> 
> So confusing.  Is the "feature disable flag" enabled by default, or
> is the feature enabled by default?

Will change it to the following if it’s clear:
The "First extent share “ feature is enabled to default. User -n to disable it.

> 
>>                 -a  Enable readahead feature, disabled by default.
> 
> Same confusion, but opposite logic.
> 
> I would highly recommend that you get a native english speaker to
> review, spell and grammar check the documentation before the next
> time you post it.

OK, will try to do so.

> 
>> We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
>> sleep time. Here comes some number of the test:
>> 
>> Test: running of defrag on the image file which is used for the back end of a block device in a
>>      virtual machine. At the same time, fio is running at the same time inside virtual machine
>>      on that block device.
>> block device type:   NVME
>> File size:           200GiB
>> paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
>> Defrag run time:     223 minutes
>> Number of extents:   6745489(before) -> 203571(after)
> 
> So and average extent size of ~32kB before, 100MB after? How much of
> these are shared extents?

Zero shared extents, but there are some unwritten ones.
A similar run stats is like this:
Pre-defrag 6654460 extents detected, 112228 are "unwritten",0 are "shared” 
Tried to defragment 6393352 extents (181000359936 bytes) in 26032 segments Time stats(ms): max clone: 31, max unshare: 300, max punch_hole: 66
Post-defrag 282659 extents detected

> 
> Runtime is 13380secs, so if we copied 200GiB in that time, the
> defrag ran at 16MB/s. That's not very fast.
> 

We are chasing the balance of defrag and parallel IO latency.

> What's the CPU utilisation of the defrag task and kernel side
> processing? What is the difference between "first_extent_share"
> enabled and disabled (both performance numbers and CPU usage)?

On my test VM (spindle based disk I think). CPU usage is about 6% for
The defrag command. Kernel processes much lower.
I didn’t pay much attention to the CPU usage when “first_extent_share” is disabled. But think
That caused very high CPU usages.

> 
>> Fio read latency:    15.72ms(without defrag) -> 14.53ms(during defrag)
>> Fio write latency:   32.21ms(without defrag) -> 20.03ms(during defrag)
> 
> So the IO latency is *lower* when defrag is running? That doesn't
> make any sense, unless the fio throughput is massively reduced while
> defrag is running.  

That’s reasonable. For the segments that defrag is done, the page cache remains.
 

> What's the throughput change in the fio
> workload? What's the change in worst case latency for the fio
> workload? i.e. post the actual fio results so we can see the whole
> picture of the behaviour, not just a single cherry-picked number.

Let me see if we have that saved.

> 
> Really, though, I have to ask: why is this an xfs_spaceman command
> and not something built into the existing online defrag program
> we have (xfs_fsr)?
> 

Quotation from previous conversation:
“”””" 
> Let me know if you want it in xfsprog.

Yes, i think adding it as an xfs_spaceman command would be a good
way for this defrag feature to be maintained for anyone who has need
for it.

-Dave.
“”””””

Thanks,
Wengang

> I'm sure I'll hav emore questions as I go through the code - I'll
> start at the userspace IO engine part of the patchset so I have some
> idea of what the defrag algorithm actually is...
> 
> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx





[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux