[no subject]

**Date** **Thread**

This is how xfs_fsr works - it tries to preallocate all the space it
will need before it starts moving data. If it fails to preallocate
all the space, it aborts. If it fails to find large enough
contiguous free spaces to improve the layout of the file, it aborts.

IOWs, xfs_fsr policy is that it doesn't care about the amount of
free space in the filesystem, it just cares if the result will
improve the layout of the file.  That's basically how any online
background defrag operation should work - if the new
layout is worse than the existing layout, or there isn't space for
the new layout to be allocated, just abort.

> >>              Safty and consistency
> >> 
> >>              The defragmentation file is guanrantted safe and data consistent for ctrl-c and kernel
> >>              crash.
> > 
> > Which file is the "defragmentation file"? The source or the temp
> > file?
> 
> I donâ??t think there is "source concept" here. There is no data copy between files.
> â??The defragmentation fileâ?? means the file under defrag, I will change it to â??The file under defragâ??.
> I donâ??t think users care about the temporary file at all.

Define the terms you use rather than assuming the reader
understands both the terminology you are using and the context in
which you are using them.

.....

> > 
> >>              The command takes the following options:
> >>                 -f free_space
> >>                     The shreshold of XFS free blocks in MiB. When free blocks are less than this
> >>                     number, (partially) shared segments are excluded from defragmentation. Default
> >>                     number is 1024
> > 
> > When you are down to 4MB of free space in the filesystem, you
> > shouldn't even be trying to run defrag because all the free space
> > that will be left in the filesystem is single blocks. I would have
> > expected this sort of number to be in a percentage of capacity,
> > defaulting to something like 5% (which is where we start running low
> > space algorithms in the kernel).
> 
> I would like leave this to user.

Again: How is the user going to know what to set this to? What
problem is this avoiding that requires the user to change this in
any way.

> When user is doing defrag on low free space system, it wonâ??t cause
> Problem to file system its self. At most the defrag fails during unshare when allocating blocks.

Why would we even allow a user to run defrag near ENOSPC? It is a
well known problem that finding contiguous free space when we are close
to ENOSPC is difficult and so defrag often is unable to improve the
situation when we are within a few percent of the filesysetm being
full.

It is also a well known problem that defragmentation at low free
space trades off contiguous free space for fragmented free space.
Hence when we are at low free space, defrag makes the free space
fragmetnation worse, which then results in all allocation in the
filesystem getting worse and more fragmented. This is something we
absolutely should be trying to avoid.

This is one of the reasons xfs_fsr tries to layout the entire
file before doing any IO - when about 95% full, it's common for the
new layout to be worse than the original file's layout because there
isn't sufficient contiguous free space to improve the layout.

IOWs, running defragmentation when we are above 95% full is actively
harmful to the longevity of the filesystem. Hence, on a fundamental
level, having a low space threshold in a defragmentation tool is
simply wrong - defragmentation should simply not be run when the
filesystem is anywhere near full.

.....

> >> 
> >>                 -s segment_size
> >>                     The size limitation in bytes of segments. Minimium number is 4MiB, default
> >>                     number is 16MiB.
> > 
> > Why were these numbers chosen? What happens if the file has ~32MB
> > sized extents and the user wants the file to be returned to a single
> > large contiguous extent it possible? i.e. how is the user supposed
> > to know how to set this for any given file without first having
> > examined the exact pattern of fragmentations in the file?
> 
> Why customer want the file to be returned to a single large contiguous extent?
> A 32MB extent is pretty good to me.  I didnâ??t here any customer
> complain about 32MB extentsâ?¦

There's a much wider world out there than just Oracle customers.
Just because you aren't aware of other use cases that exist, it
doesn't mean they don't exist. I know they exist, hence my question.

For example, extent size hints are used to guarantee that the data
is aligned to the underlying storage correctly, and very large
contiguous extents are required to avoid excessive seeks during
sequential reads that result in critical SLA failures. Hence if a
file is poorly laid out in this situation, defrag needs to return it
to as few, maximally sized extents as it can. How does a user know
what they'd need to set this segment size field to and so acheive
the result they need?

> And you know, whether we can defrag extents to a large one depends on not only the tool itâ??s self.
> Itâ??s depends on the status of the filesystem, say if the filesystem is very fragmented too, say the AG size..
> 
> The 16MB is selected according our tests basing on a customer metadump. With 16MB segment size,
> The the defrag result is very good and the IO latency is acceptable too.  With the default 16MB segment
> Size, 32MB extent is excluded from defrag.

Exactly my point: you have written a solution that works for a
single filesystem in a single environment.  However, the solution is
so specific to the single problem you need to solve that it is not
clear whether that functionality or defaults are valid outside of
the specific problem case you've written it for and tested it on.

> If you have better default size, we can use that.

I'm not convinced that fixed size "segments" is even the right way
to approach this problem. What needs to be done is dependent on the
extent layout of the file, not how extents fit over some arbitrary
fixed segment map....

> >> We tested with real customer metadump with some different 'idle_time's and found 250ms is good pratice
> >> sleep time. Here comes some number of the test:
> >> 
> >> Test: running of defrag on the image file which is used for the back end of a block device in a
> >>      virtual machine. At the same time, fio is running at the same time inside virtual machine
> >>      on that block device.
> >> block device type:   NVME
> >> File size:           200GiB
> >> paramters to defrag: free_space: 1024 idle_time: 250 First_extent_share: enabled readahead: disabled
> >> Defrag run time:     223 minutes
> >> Number of extents:   6745489(before) -> 203571(after)
> > 
> > So and average extent size of ~32kB before, 100MB after? How much of
> > these are shared extents?
> 
> Zero shared extents, but there are some unwritten ones.
> A similar run stats is like this:
> Pre-defrag 6654460 extents detected, 112228 are "unwritten",0 are "sharedâ?? 
> Tried to defragment 6393352 extents (181000359936 bytes) in 26032 segments Time stats(ms): max clone: 31, max unshare: 300, max punch_hole: 66
> Post-defrag 282659 extents detected
> 
> > 
> > Runtime is 13380secs, so if we copied 200GiB in that time, the
> > defrag ran at 16MB/s. That's not very fast.
> > 
> 
> We are chasing the balance of defrag and parallel IO latency.

My point is that stuff like CLONE and UNSHARE should be able to run
much, much faster than this, even if some of the time is left idle
for other IO.

i.e. we can clone extents at about 100,000/s. We can copy data
through the page cache at 7-8GB/s on NVMe devices.

A full clone of the 6.6 million extents should only take about
a minute.

A full page cache copy of the 200GB cloned file (i.e. via read/write
syscalls) should easily run at >1GB/s, and so only take a couple of
minutes to run.

IOWs, the actual IO and metadata modification side of things is
really only about 5 minutes worth of CPU and IO.

Hence this defrag operation is roughly 100x slower than we should be
able to run it at.  We should be able to run it at close to those
speeds whilst still allowing concurrent read access to the file.

If an admin then wants it to run at 16MB/s, it can be throttled
to that speed using cgroups, ionice, etc.

i.e. I think you are trying to solve too many unnecessary problems
here and not addressing the one thing it should do: defrag a file as
fast and efficiently as possible.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx