Re: [PATCH 0/9] introduce defrag to xfs_spaceman

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 16 Jul 2024 09:03:42 +1000

[ Please keep documentation text to 80 columns. ] 

[ Please run documentation through a spell checker - there are too
many typos in this document to point them all out... ]

On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote:
> This patch set introduces defrag to xfs_spaceman command. It has the functionality and
> features below (also subject to be added to man page, so please review):

What's the use case for this?

>        defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
>               defrag defragments the specified XFS file online non-exclusively. The target XFS

What's "non-exclusively" mean? How is this different to what xfs_fsr
does?

>               doesn't have to (and must not) be unmunted.  When defragmentation is in progress, file
>               IOs are served 'in parallel'.  reflink feature must be enabled in the XFS.

xfs_fsr allows IO to occur in parallel to defrag.

>               Defragmentation and file IOs
> 
>               The target file is virtually devided into many small segments. Segments are the
>               smallest units for defragmentation. Each segment is defragmented one by one in a
>               lock->defragment->unlock->idle manner.

Userspace can't easily lock the file to prevent concurrent access.
So I'mnot sure what you are refering to here.

>               File IOs are blocked when the target file is locked and are served during the
>               defragmentation idle time (file is unlocked).

What file IOs are being served in parallel? The defragmentation IO?
something else?

>               Though
>               the file IOs can't really go in parallel, they are not blocked long. The locking time
>               basically depends on the segment size. Smaller segments usually take less locking time
>               and thus IOs are blocked shorterly, bigger segments usually need more locking time and
>               IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
>               service.

How is a user supposed to know what the correct values are for their
storage, files, and workload? Algorithms should auto tune, not
require users and administrators to use trial and error to find the
best numbers to feed a given operation.

>               Temporary file
> 
>               A temporary file is used for the defragmentation. The temporary file is created in the
>               same directory as the target file is and is named ".xfsdefrag_<pid>". It is a sparse
>               file and contains a defragmentation segment at a time. The temporary file is removed
>               automatically when defragmentation is done or is cancelled by ctrl-c. It remains in
>               case kernel crashes when defragmentation is going on. In that case, the temporary file
>               has to be removed manaully.

O_TMPFILE, as Darrick has already requested.

> 
>               Free blocks consumption
> 
>               Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
>               then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
>               to the number the newly allocated blocks. As a finally result, defragmentation doesn't
>               consume free blocks.  Well, that is true if the target file is not sharing blocks with
>               other files.

This is really hard to read. Defragmentation will -always- consume
free space while it is progress. It will always release the
temporary space it consumes when it completes.

>               In case the target file contains shared blocks, those shared blocks won't
>               be freed back to filesystem as they are still owned by other files. So defragmenation
>               allocates more blocks than it frees.

So this is doing an unshare operation as well as defrag? That seems
... suboptimal. The whole point of sharing blocks is to minimise
disk usage for duplicated data.

>               For existing XFS, free blocks might be over-
>               committed when reflink snapshots were created. To avoid causing the XFS running into
>               low free blocks state, this defragmentation excludes (partially) shared segments when
>               the file system free blocks reaches a shreshold. Check the -f option.

Again, how is the user supposed to know when they need to do this?
If the answer is "they should always avoid defrag on low free
space", then why is this an option?

>               Safty and consistency
> 
>               The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
>               crash.

Which file is the "defragmentation file"? The source or the temp
file?

>               First extent share
> 
>               Current kernel has routine for each segment defragmentation detecting if the file is
>               sharing blocks.

I have no idea what this means, or what interface this refers to.

>               It takes long in case the target file contains huge number of extents
>               and the shared ones, if there is, are at the end. The First extent share feature works
>               around above issue by making the first serveral blocks shared. Seeing the first blocks
>               are shared, the kernel routine ends quickly. The side effect is that the "share" flag
>               would remain traget file. This feature is enabled by default and can be disabled by -n
>               option.

And from this description, I have no idea what this is doing, what
problem it is trying to work around, or why we'd want to share
blocks out of a file to speed up detection of whether there are
shared blocks in the file. This description doesn't make any sense
to me because I don't know what interface you are actually having
performance issues with. Please reference the kernel code that is
problematic, and explain why the existing kernel code is problematic
and cannot be fixed.

>               extsize and cowextsize
> 
>               According to kernel implementation, extsize and cowextsize could have following impacts
>               to defragmentation: 1) non-zero extsize causes separated block allocations for each
>               extent in the segment and those blocks are not contiguous.

Extent size hints do no such thing. The simply provide extent
alignment guidelines and do not affect things like contiguous or
multi-block allocation lengths.

>               The segment remains same
>               number of extents after defragmention (no effect).  2) When extsize and/or cowextsize
>               are too big, a lot of pre-allocated blocks remain in memory for a while. When new IO
>               comes to whose pre-allocated blocks  Copy on Write happens and causes the file
>               fragmented.

extsize based unwritten extents won't cause COW or cause
fragmentation because they aren't shared and they are contiguous.
I suspect that your definition of "fragmented" isn't taking into
account that unwritten-written-unwritten over a contiguous range
is *not* fragmentation. It's just a contiguous extent in different
states, and this should really not be touched/changed by
defragmentation.

check out xfs_fsr: it ensures that the pattern of unwritten/written
blocks in the defragmented file is identical to the source. i.e. it
preserves preallocation because the application/fs config wants it
to be there....

>               Readahead
> 
>               Readahead tries to fetch the data blocks for next segment with less locking in
>               backgroud during idle time. This feature is disabled by default, use -a to enable it.

What are you reading ahead into? Kernel page cache or user buffers?
Either way, it's hardly what I'd call "idle time" if the defrag
process is using it to issue lots of read IO...

>               The command takes the following options:
>                  -f free_space
>                      The shreshold of XFS free blocks in MiB. When free blocks are less than this
>                      number, (partially) shared segments are excluded from defragmentation. Default
>                      number is 1024

When you are down to 4MB of free space in the filesystem, you
shouldn't even be trying to run defrag because all the free space
that will be left in the filesystem is single blocks. I would have
expected this sort of number to be in a percentage of capacity,
defaulting to something like 5% (which is where we start running low
space algorithms in the kernel).

>                  -i idle_time
>                      The time in milliseconds, defragmentation enters idle state for this long after
>                      defragmenting a segment and before handing the next. Default number is TOBEDONE.

Yeah, I don't think this is something anyonce whould be expected to
use or tune. If an idle time is needed, the defrag application
should be selecting this itself.
> 
>                  -s segment_size
>                      The size limitation in bytes of segments. Minimium number is 4MiB, default
>                      number is 16MiB.

Why were these numbers chosen? What happens if the file has ~32MB
sized extents and the user wants the file to be returned to a single
large contiguous extent it possible? i.e. how is the user supposed
to know how to set this for any given file without first having
examined the exact pattern of fragmentations in the file?

>                  -n  Disable the First extent share feature. Enabled by default.

So confusing.  Is the "feature disable flag" enabled by default, or
is the feature enabled by default?

>                  -a  Enable readahead feature, disabled by default.

Same confusion, but opposite logic.

I would highly recommend that you get a native english speaker to
review, spell and grammar check the documentation before the next
time you post it.

> We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
> sleep time. Here comes some number of the test:
> 
> Test: running of defrag on the image file which is used for the back end of a block device in a
>       virtual machine. At the same time, fio is running at the same time inside virtual machine
>       on that block device.
> block device type:   NVME
> File size:           200GiB
> paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
> Defrag run time:     223 minutes
> Number of extents:   6745489(before) -> 203571(after)

So and average extent size of ~32kB before, 100MB after? How much of
these are shared extents?

Runtime is 13380secs, so if we copied 200GiB in that time, the
defrag ran at 16MB/s. That's not very fast.

What's the CPU utilisation of the defrag task and kernel side
processing? What is the difference between "first_extent_share"
enabled and disabled (both performance numbers and CPU usage)?

> Fio read latency:    15.72ms(without defrag) -> 14.53ms(during defrag)
> Fio write latency:   32.21ms(without defrag) -> 20.03ms(during defrag)

So the IO latency is *lower* when defrag is running? That doesn't
make any sense, unless the fio throughput is massively reduced while
defrag is running.  What's the throughput change in the fio
workload? What's the change in worst case latency for the fio
workload? i.e. post the actual fio results so we can see the whole
picture of the behaviour, not just a single cherry-picked number.

Really, though, I have to ask: why is this an xfs_spaceman command
and not something built into the existing online defrag program
we have (xfs_fsr)?

I'm sure I'll hav emore questions as I go through the code - I'll
start at the userspace IO engine part of the patchset so I have some
idea of what the defrag algorithm actually is...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx