Re: [PATCH 0/9] introduce defrag to xfs_spaceman

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jul 16, 2024 at 07:45:37PM +0000, Wengang Wang wrote:
> 
> 
> > On Jul 15, 2024, at 4:03 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > 
> > [ Please keep documentation text to 80 columns. ] 
> > 
> 
> Yes. This is not a patch. I copied it from the man 8 output.
> It will be limited to 80 columns when sent as a patch.
> 
> > [ Please run documentation through a spell checker - there are too
> > many typos in this document to point them all out... ]
> 
> OK.
> 
> > 
> > On Tue, Jul 09, 2024 at 12:10:19PM -0700, Wengang Wang wrote:
> >> This patch set introduces defrag to xfs_spaceman command. It has the functionality and
> >> features below (also subject to be added to man page, so please review):
> > 
> > What's the use case for this?
> 
> This is the user space defrag as you suggested previously.
> 
> Please see the previous conversation for your reference: 
> https://patchwork.kernel.org/project/xfs/cover/20231214170530.8664-1-wen.gang.wang@xxxxxxxxxx/

That's exactly what you should have put in the cover letter!

The cover letter is not for documenting the user interface of a new
tool - that's what the patch in the patch set for the new man page
should be doing.

The cover letter should contain references to past patch sets and
discussions on the topic. The cover letter shoudl also contain a
changelog that documents what is different in this new version of
the patch set so reviewers know what you've changed since they last
looked at it.

IOWs, the cover letter for explaining the use case, why the
functionality is needed, important design/implementation decisions
and the history of the patchset. It's meant to inform and remind
readers of what has already happened to get to this point.

> COPY STARTS —————————————> 
> I am copying your last comment there:
> 
> On Tue, Dec 19, 2023 at 09:17:31PM +0000, Wengang Wang wrote:
> > Hi Dave,
> > Yes, the user space defrag works and satisfies my requirement (almost no change from your example code).
> 
> That's good to know :)
> 
> > Let me know if you want it in xfsprog.
> 
> Yes, i think adding it as an xfs_spaceman command would be a good
> way for this defrag feature to be maintained for anyone who has need
> for it.

Sure, I might have said that 6 months ago. When presented with a
completely new implementation in a new context months later, I might
see things differently.  Everyone is allowed to change their mind,
opinions and theories as circumstances, evidence and contexts
change.

Indeed when I look at this:

> >>       defrag [-f free_space] [-i idle_time] [-s segment_size] [-n] [-a]
> >>              defrag defragments the specified XFS file online non-exclusively. The target XFS

I didn't expect anything nearly as complex and baroque as this. All
I was expecting was something like this to defrag a single range of
a file:

	xfs_spaceman -c "defrag <offset> <length>" <file>

As the control command, and then functionality for
automated/periodic scanning and defrag would still end up being
co-ordinated by the existing xfs_fsr code.

> > What's "non-exclusively" mean? How is this different to what xfs_fsr
> > does?
> > 
> 
> I think you have seen the difference when you reviewing more of this set.
> Well, if I read xfs_fsr code correctly, though xfs_fsr allow parallel writes, it looks have a problem(?)
> As I read the code, Xfs_fsr do the followings to defrag one file:
> 1) preallocating blocks to a temporary file hoping the temporary get same number of blocks as the
>     file under defrag with with less extents.
> 2) copy data blocks from the file under defrag to the temporary file.
> 3) switch the extents between the two files.
> 
> For stage 2, it’s NOT copying data blocks in atomic manner. Take an example: there need two
> Read->write pair to complete the data copy, that is
>     Copy range 1 (read range 1 from the file under defrag to the temporary file)
>     Copy range 2

I wasn't asking you to explain to me how the xfs_fsr algorithm
works. What I was asking for was a definition of what
"non-exclusively" means.

What xfs_fsr currently does meets my definition of "non-exclusive" - it does
not rely on or require exclusive access to the file being
defragmented except for the atomic extent swap at the end. However,
using FICLONE/UNSHARE does require exclusive access to the file be
defragmented for the entirity of those operations, so I don't have
any real idea of why this new algorithm is explicitly described as
"non-exclusive".

Defining terms so everyone has a common understanding is important.

Indeed, Given that we now have XFS_IOC_EXCHANGE_RANGE, I'm
definitely starting to wonder if clone/unshare is actually the best
way to do this now.  I think we could make xfs_fsr do iterative
small file region defrag using XFS_IOC_EXCHANGE_RANGE instead of
'whole file at once' as it does now. If we were also to make fsr
aware of shared extents 

> > 
> >>              Defragmentation and file IOs
> >> 
> >>              The target file is virtually devided into many small segments. Segments are the
> >>              smallest units for defragmentation. Each segment is defragmented one by one in a
> >>              lock->defragment->unlock->idle manner.
> > 
> > Userspace can't easily lock the file to prevent concurrent access.
> > So I'mnot sure what you are refering to here.
> 
> The manner is not simply meant what is done at user space, but a whole thing in both user space
> and kernel space.  The tool defrag a file segment by segment. The lock->defragment->unlock
> Is done by kernel in responding to the FALLOC_FL_UNSHARE_RANGE request from user space.

I'm still not sure what locking you are trying to describe. There
are multiple layers of locking in the kernel, and we use them
differently. Indeed, the algorithm you have described is actually

	FICLONERANGE
	IOLOCK shared
	ILOCK exclusive
	remap_file_range()
	IUNLOCK exclusive
	IOUNLOCK shared

	.....

	UNSHARE_RANGE
	IOLOCK exclusive
	MMAPLOCK exclusive
	<drain DIO in flight>
	ILOCK exclusive
	unshare_range()
	IUNLOCK exclusive
	MMAPUNLOCK exclusive
	IOUNLOCK shared

And so there isn't a single "lock -> defrag -> unlock" context
occurring - there are multiple independent operations that have
different kernel side locking contexts and there are no userspace
side file locking contexts, either.

> > 
> >>              File IOs are blocked when the target file is locked and are served during the
> >>              defragmentation idle time (file is unlocked).
> > 
> > What file IOs are being served in parallel? The defragmentation IO?
> > something else?
> 
> Here the file IOs means the IOs request from user space applications including virtual machine
> Engine.
> 
> > 
> >>              Though
> >>              the file IOs can't really go in parallel, they are not blocked long. The locking time
> >>              basically depends on the segment size. Smaller segments usually take less locking time
> >>              and thus IOs are blocked shorterly, bigger segments usually need more locking time and
> >>              IOs are blocked longer. Check -s and -i options to balance the defragmentation and IO
> >>              service.
> > 
> > How is a user supposed to know what the correct values are for their
> > storage, files, and workload? Algorithms should auto tune, not
> > require users and administrators to use trial and error to find the
> > best numbers to feed a given operation.
> 
> In my option, user would need a way to control this according to their use case.
> Any algorithms will restrict what user want to do.
> Say, user want the defrag done as quick as possible regardless the resources it takes (CPU, IO and so on)
> when the production system is in a maintenance window. But when the production system is busy
> User want the defrag use less resources.

That's not for the defrag program to implement That's what we use
resource control groups for. Things like memcgs, block IO cgroups,
scheduler cgroups, etc. Administrators are used to restricting the
resources used by applications with generic admin tools; asking them
to learn how some random admin tool does it's own resrouce
utilisation restriction that requires careful hand tuning for -one
off admin events- is not the right way to solve this problem.

We should be making the admin tool go as fast as possible and
consume as much resources as are available. This makes it fast out
of the box, and lets the admins restrict the IO rate, CPU and memory
usage to bring it down to an acceptible resource usage level for
admin tasks on their systems.

> Another example, kernel (algorithms) never knows the maximum IO latency the user applications tolerate.
> But if you have some algorithms, please share.

As I said - make it as fast and low latency as reasonably possible.
If you have less than 10ms IO latency SLAs, the application isn't
going to be running on sparse, software defined storage that may
require hundreds of milliseconds of IO pauses during admin tasks.
Hence design to a max fixed IO latency (say 100ms) and make the
funcitonality run as fast as possible within that latency window.

If people need lower latency SLAs, then they shouldn't be running
that application on sparse, COW based VM images. This is not a
problem a defrag utility should be trying to solve.

> >>              Free blocks consumption
> >> 
> >>              Defragmenation works by (trying) allocating new (contiguous) blocks, copying data and
> >>              then freeing old (non-contig) blocks. Usually the number of old blocks to free equals
> >>              to the number the newly allocated blocks. As a finally result, defragmentation doesn't
> >>              consume free blocks.  Well, that is true if the target file is not sharing blocks with
> >>              other files.
> > 
> > This is really hard to read. Defragmentation will -always- consume
> > free space while it is progress. It will always release the
> > temporary space it consumes when it completes.
> 
> I don’t think it’s always free blocks when it releases the temporary file. When the blocks were
> Original shared before defrag, the blocks won’t be freed.

I didn't make myself clear. If the blocks shared to the temp file
are owned exclusively by the source file (i.e. they were COW'd from
shared extents at some time in the past), then that is space
that is temporarily required by the defragmentation process. UNSHARE
creates a second, permanent copy of those blocks in the source file
and closing of the temp file them makes the original exclusively
owned blocks go away.

IOWs, defrag can temporarily consume an entire extra file's worth of
space between the UNSHARE starting and the freeing of the temporary
file when we are done with it. Freeing the temp file -always-
releases this extra space, though I note that the implementation is
to hole-punch it away after each segment has been processed.

> > 
> >>              In case the target file contains shared blocks, those shared blocks won't
> >>              be freed back to filesystem as they are still owned by other files. So defragmenation
> >>              allocates more blocks than it frees.
> > 
> > So this is doing an unshare operation as well as defrag? That seems
> > ... suboptimal. The whole point of sharing blocks is to minimise
> > disk usage for duplicated data.
> 
> That depends on user's need. If users think defrag is the first
> priority, it is.  If users don’t think the disk
> saving is the most important, it is not. No matter what developers think.
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That's pretty ... dismissive.

I mean, you're flat out wrong. You make the assumption that a user
knows exactly how every file that every application in their system
has been created and knows exactly how best to defragment it.

That's just .... wrong.

Users and admins do not have intimate knowledge of how their
applications do their stuff, and a lot of them don't even know
that their systems are using file clones (i.e. reflink copies)
instead of data copies extensively these days.

That is completely the wrong way to approach administration
tools. 

Our defragmentation policy for xfs_fsr is to leave the structure of
the file as intact as possible. That means we replicate unwritten
regions in the defragmented file. We actually -defragment unwritten
extents- in xfs_fsr, not just written extents, and we do that
because we have to assume that the unwritten extents exist for a
good reason.

We don't expect the admin to make a decision as to whether unwritten
extents should be replicated or defragged - we make the assumption
that either the application or the admin has asked for them to exist
in the first place.

It is similar for defragmenting files that are largely made up of shared
extents. That layout exists for a reason, and it's not the place of
the defragmentation operation to unilaterally decide layout policies
for the admin and/or application that is using files with shared
extents.

Hence the defrag operation must preserve the *observed intent* of
the source file layout as much as possible and not require the admin
or user to be sufficiently informed to make the right decision one
way or another. We must attempt to preserve the status quo.

Hence if the file is largely shared, we must not unshare the entire
file to defragment it unless that is the only way to reduce the
fragmentation (e.g. resolve small interleaved shared and unshared
extents). If there are reasonable sized shared extents, we should be
leaving them alone and not unsharing them just to reduce the extent
count by a handful of extents.

> What’s more, reflink (or sharing blocks) is not only used to minimize disk usage. Sometimes it’s
> Used as way to take snapshots. And those snapshots might won’t stay long.

Yes, I know this. It doesn't change anything to do with how we
defragment a file that contains shared blocks.

If you don't want the snapshot(s) to affect defragmentation, then
don't run defrag while the snapshots are present. Otherwise, we
want defrag to retain as much sharing between the snapshots and
the source file because *minimising the space used by snapshots* is
the whole point of using file clones for snapshots in the first
place!

> And what’s more is that, the unshare operation is what you suggested :D   

I suggested it as a mechanism to defrag regions of shared files with
excessive fragmentation. I was not suggesting that "defrag ==
unshare".

> >>              For existing XFS, free blocks might be over-
> >>              committed when reflink snapshots were created. To avoid causing the XFS running into
> >>              low free blocks state, this defragmentation excludes (partially) shared segments when
> >>              the file system free blocks reaches a shreshold. Check the -f option.
> > 
> > Again, how is the user supposed to know when they need to do this?
> > If the answer is "they should always avoid defrag on low free
> > space", then why is this an option?
> 
> I didn’t say "they should always avoid defrag on low free space”. And even we can’t say how low is
> Not tolerated by user, that depends on user use case. Though it’s an option, it has the default value
> Of 1GB. If users don’t set this option, that is "always avoid defrag on low free space”.

You didn't answer my question: how is the user supposed to know
when they should set this?

And, again, the followup question is: why does this need to be
built into the defrag tool?


[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux