Re: XFS reflink overhead, ioctl(FICLONE)

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 20 Dec 2022 13:16:19 +1100

On Sun, Dec 18, 2022 at 06:40:54PM -0500, Terence Kelly wrote:
> 
> 
> On Sun, 18 Dec 2022, Dave Chinner wrote:
> 
> > > https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf
> > 
> > Ah, now I get it. You want *anonymous ephemeral clones*, not named
> > persistent clones.  For everyone else, so they don't have to read the
> > paper and try to work it out:
> > 
> > The mechanism is a hacked the O_ATOMIC path ...
> 
> No.  To be clear, nobody now in 2022 is asking for the AdvFS features of the
> FAST 2015 paper to be implemented in XFS (or BtrFS or any other FS).
> 
> The current XFS/BtrFS/Liunux ioctl(FICLONE) is perfect for my current and
> foreseeable needs, except for performance.
> 
> I cited the FAST 2015 paper simply to show that I've worked with a
> clone-based mechanism in the past and it delighted me in every way.  It's
> simply an existence proof that cloning can be delightful for crash
> tolerance.

Sure, you're preaching to the choir. But the context was quoting a
paper as an example of the cloning performance you expected from XFS
but weren't getting. You're still talking about how XFS clones are
too slow for you needs, but now you are saying you don't want
clones for fault tolerance as implemented in advfs

> > Hence the difference in functionality is that FICLONE provides
> > persistent, unrestricted named clones rather than ephemeral clones.
> 
> For the record, the AdvFS implementation of clone-based crash tolerance ---
> the moral equivalent of failure-atomic msync(), which was the topic of my
> EuroSys 2013 paper --- involved persistent files on durable storage; the
> files were hidden and were discarded when their usefulness was over but the
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is the very definition of an ephemeral filesystem object.

The clones are temporary filesystem objects that exist only within
the context of an active file descriptor, users doesn't know they
exist, users cannot discover their existence, and they get cleaned
up automatically by the filesystem when they are no longer useful.

Yes, there is some persistent state needed to implement the required
garbage collection semantics of the ephemeral object (just like
O_TMPFILE!), but that doesn't change the fact that users don't know
(or care) that the internal filesystem objects even exist.

Really, I can't think of a better example of an ephemeral object
than this, regardless of whether the paper's authors used that term
or not.

> hidden files were not "ephemeral" in the sense of a file in a DRAM-backed
> file system (/tmp/ or /dev/shm/ or whatnot).  AdvFS crash tolerance survived
> real power failures.  But this is a side issue of historical interest only.
>
> I mainly want to emphasize that nobody is asking for the behavior of AdvFS
> in that FAST 2015 paper.

OK, so what are you asking us to do, then?

[....]

> > > https://dl.acm.org/doi/pdf/10.1145/3400899.3400902
> > 
> > Heh. You're still using hardware to do filesystem power fail testing? We
> > moved away from needing hardware to do power fail testing of filesystems
> > several years ago.
> > 
> > Using functionality like dm-logwrites, we can simulate the effect of
> > several hundred different power fail cases with write-by-write replay
> > and recovery in the space of a couple of minutes.
> 
> Cool.  I assume you're familiar with a paper on a similar technique that my
> HP Labs colleagues wrote circa 2013 or 2014:  "Torturing Databases for Fun
> and Profit."

Nope, but it's not a new or revolutionary technique so I'm not
surprised that other people have done similar things. There's been
plenty of research based on model checking over the past 2-3 decades
- the series of Iron Filesystem papers is a good example of this.
What we have in fstests is just a version of these concepts that
simplifies discovering and debugging previously undiscovered write
ordering issues...

> > Not only that, failures are fully replayable and so we can actually
> > debug every single individual failure without having to guess at the
> > runtime context that created the failure or the recovery context that
> > exposed the failure.
> > 
> > This infrastructure has provided us with a massive step forward for
> > improving crash resilence and recovery capability in ext4, btrfs and
> > XFS.  These tests are built into automated tests suites (e.g. fstests)
> > that pretty much all linux fs engineers and distro QE teams run these
> > days.
> 
> If you think the world would benefit from reading about this technique and
> using it more widely, I might be able to help.  My column in _Queue_
> magazine reaches thousands of readers, sometimes tens of thousands.  It's
> about teaching better techniques to working programmers.

You're welcome to do so - the source code is all there, there's a
mailing list for fstests where you can ask questions about it, etc.
If you think it's valuable for people outside the core linux fs
developer community, then you don't need to ask our permission to
write an article on it....

> > > I'm surprised that in XFS, cloning alone *without* fsync() pushes
> > > data down to storage.  I would have expected that the implementation
> > > of cloning would always operate upon memory alone, and that an
> > > explicit fsync() would be required to force data down to durable
> > > media. Analogy:  write() doesn't modify storage; write() plus
> > > fsync() does. Is there a reason why copying via ioctl(FICLONE) isn't
> > > similar?
> > 
> > Because FICLONE provides a persistent named clone that is a fully
> > functioning file in it's own right.  That means it has to be completely
> > indepedent of the source file by the time the FICLONE operation
> > completes.  This implies that there is a certain order to the operations
> > the clone performances - the data has to be on disk before the clone is
> > made persistent and recoverable so that both files as guaranteed to have
> > identical contents if we crash immediately after the clone completes.
> 
> I thought the rule was that if an application doesn't call fsync() or
> msync(), no durability of any kind is guaranteed.

No durability of any kind is guaranteed, but that doesn't preclude
the OS and/or filesystem actually performing an operation in a way
that guarantees persistence....

That said, the FICLONE API doesn't guarantee persistence. The
application still have to call fdatasync() to ensure that all the
metadata changes that FICLONE makes are persisted all the way down
to stable storage.

> I thought modern file
> systems did all their work in DRAM until an explicit fsync/msync or other
> necessity compelled them to push data down to durable media (in the right
> order etc.).

Largely, they do. But some operations have dependencies and require
data/metadata update synchronisation, and at that point we have
ordering constraints. TO an outside observer, that may look like
the filesystem is trying to provide durability, but in fact it is
doing nothing of the sort...

I suspect you've seen the data writeback in FICLONE and thought this
is because it needs to provide a durability guarantee.

For XFS, this is an ordering constrain - we have to ensure the right
thing happens with delayed allocation and resolve pending COW
operations on a file before we clone the extent map to a new file.
We do this by running writeback to process these pending extent map
operations we deferred at write() time. Once those deferred
operations have been resolved, we can run the transactions to clone
the extent map.

However, if FICLONE is acting on files containing only data at rest,
then it can run without doing a single data IO, and the whole clone
can be lost on crash if fdatasync() is not run once it is complete.

IOWs, the FICLONE API provides no persistence guarantees.
fdatasync/O_DSYNC is still required.

> Also, we might be using terminology differently:
> 
> I use "persistent" in the sense of "outlives processes".  Files in /tmp/ and
> /dev/shm/ are persistent, but not durable.

Yeah, different terminology - you seem to have different frames of
reference for the terms you are using.

The frame of reference I'm using for terminology is filesystem
objects rather than processes or storage.  Stuff that exists purely
in memory (such as tmpfs or shm files) is always considered
"volatile" - they are lost if the system crashes or shuts down.
Volatile storage also include caches like dirty data in the page
cache and storage devices with DRAM based caches.

Persistent refers to ensuring filesystem objects are not volatile;
they do not get lost during shutdown or abnormal termination because
they have been guaranteed to exist on a stable, permanent storage
media. 

> I use "durable" to mean "written to non-volatile media (HDD or SSD) in such
> a way as to guarantee that it will survive power cycling."

Sure. We typically refer to non-volatile storage media as "stable
storage" because the hardware can be durable in the short term but
volatile in the long term. e.g. battery backed RAM is considered
"stable" if the battery backup lasts longer than 72 hours, but
over long periods it will not retain it's contents. Hence calling it
"non-volatile media" isn't really correct - the contents are only
stable over a fixed timeframe.

Regardless of terminology, "persisting objects to stable
storage" is effectively the same thing as "making durable".

> I expect *persistence* from ioctl(FICLONE) but I didn't expect a
> *durability* guarantee without fsync().  If I'm understanding you correctly,
> cloning in XFS gives us durability whether we want it or not.

See above. We provide no guarantees about persistence, but in some
cases we can't perform the FICLONE operation correctly without
performing most of the operations needed to provide persistence of
the source file.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx