On Sun, 18 Dec 2022, Dave Chinner wrote:
https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf
Ah, now I get it. You want *anonymous ephemeral clones*, not named
persistent clones. For everyone else, so they don't have to read the
paper and try to work it out:
The mechanism is a hacked the O_ATOMIC path ...
No. To be clear, nobody now in 2022 is asking for the AdvFS features of
the FAST 2015 paper to be implemented in XFS (or BtrFS or any other FS).
The current XFS/BtrFS/Liunux ioctl(FICLONE) is perfect for my current and
foreseeable needs, except for performance.
I cited the FAST 2015 paper simply to show that I've worked with a
clone-based mechanism in the past and it delighted me in every way. It's
simply an existence proof that cloning can be delightful for crash
tolerance.
Hence the difference in functionality is that FICLONE provides
persistent, unrestricted named clones rather than ephemeral clones.
For the record, the AdvFS implementation of clone-based crash tolerance
--- the moral equivalent of failure-atomic msync(), which was the topic of
my EuroSys 2013 paper --- involved persistent files on durable storage;
the files were hidden and were discarded when their usefulness was over
but the hidden files were not "ephemeral" in the sense of a file in a
DRAM-backed file system (/tmp/ or /dev/shm/ or whatnot). AdvFS crash
tolerance survived real power failures. But this is a side issue of
historical interest only.
I mainly want to emphasize that nobody is asking for the behavior of AdvFS
in that FAST 2015 paper.
We could implement ephemeral clones in XFS, but nobody has ever
mentioned needing or wanting such functionality until this thread.
Nobody needs or wants such functionality, even in this thread. The
current ioctl(FICLONE) is perfect except for performance.
https://dl.acm.org/doi/pdf/10.1145/3400899.3400902
Heh. You're still using hardware to do filesystem power fail testing?
We moved away from needing hardware to do power fail testing of
filesystems several years ago.
Using functionality like dm-logwrites, we can simulate the effect of
several hundred different power fail cases with write-by-write replay
and recovery in the space of a couple of minutes.
Cool. I assume you're familiar with a paper on a similar technique that
my HP Labs colleagues wrote circa 2013 or 2014: "Torturing Databases for
Fun and Profit."
Not only that, failures are fully replayable and so we can actually
debug every single individual failure without having to guess at the
runtime context that created the failure or the recovery context that
exposed the failure.
This infrastructure has provided us with a massive step forward for
improving crash resilence and recovery capability in ext4, btrfs and
XFS. These tests are built into automated tests suites (e.g. fstests)
that pretty much all linux fs engineers and distro QE teams run these
days.
If you think the world would benefit from reading about this technique and
using it more widely, I might be able to help. My column in _Queue_
magazine reaches thousands of readers, sometimes tens of thousands. It's
about teaching better techniques to working programmers.
I'd be honored to help pass along to my readers practical techniques that
you're using to improve quality.
IOWs, hardware based power fail testing of filesystems is largely
obsolete these days....
I don't mind telling the world that my own past work is obsolete. That's
what progress is all about.
I'm surprised that in XFS, cloning alone *without* fsync() pushes data
down to storage. I would have expected that the implementation of
cloning would always operate upon memory alone, and that an explicit
fsync() would be required to force data down to durable media.
Analogy: write() doesn't modify storage; write() plus fsync() does.
Is there a reason why copying via ioctl(FICLONE) isn't similar?
Because FICLONE provides a persistent named clone that is a fully
functioning file in it's own right. That means it has to be completely
indepedent of the source file by the time the FICLONE operation
completes. This implies that there is a certain order to the operations
the clone performances - the data has to be on disk before the clone is
made persistent and recoverable so that both files as guaranteed to have
identical contents if we crash immediately after the clone completes.
I thought the rule was that if an application doesn't call fsync() or
msync(), no durability of any kind is guaranteed. I thought modern file
systems did all their work in DRAM until an explicit fsync/msync or other
necessity compelled them to push data down to durable media (in the right
order etc.).
Also, we might be using terminology differently:
I use "persistent" in the sense of "outlives processes". Files in /tmp/
and /dev/shm/ are persistent, but not durable.
I use "durable" to mean "written to non-volatile media (HDD or SSD) in
such a way as to guarantee that it will survive power cycling."
I expect *persistence* from ioctl(FICLONE) but I didn't expect a
*durability* guarantee without fsync(). If I'm understanding you
correctly, cloning in XFS gives us durability whether we want it or not.
Finally I understand your explanation that the cost of cloning is
proportional to the size of the extent map, and that in the limit where
the extent map is very large, cloning a file of size N requires O(N)
time. However the constant factors surprise me. If memory serves we
were seeing latencies of milliseconds atop DRAM for the first few
clones on files that began as sparse files and had only a few blocks
written to them. Copying the extent map on a DRAM file system must be
tantamount to a bunch of memcpy() calls (right?),
At the IO layer, yes, it's just a memcpy.
But we can't just copy a million extents from one in-memory btree to
another. We have to modify the filesystem metadata in an atomic,
transactional, recoverable way. Those transactions work one extent at a
time because each extent might require a different set of modifications.
Ah, so now I see where the time goes. This is clear.
Persistent clones require tracking of the number of times a given block
on disk is shared so that we know when extent removals result in the
extent no longer being shared and/or referenced. A file that has been
cloned a million times might have a million extents each shared a
different number of times. When we remove one of those clones, how do we
know which blocks are now unreferenced and need to be freed?
IOWs, named persistent clones are *much more complex* than ephemeral
clones.
Again, I don't know where you're getting "ephemeral" from; that word does
not appear in the FAST '15 paper. The AdvFS clones of the FAST '15 paper
were both durable and persistent; they were just hidden from the
user-visible namespace. A crash (power outage or whatever ) caused a file
to revert to the most recent hidden clone. In AdvFS, a hidden clone was
created by an fsync/msync call. This is how AdvFS made file updates
failure-atomic.
Again, we're not asking for the same functionality of the FAST '15 paper.
However if the contrast between what AdvFS did with clones and how XFS
works illuminates issues like XFS performance, then it might be worth
understanding AdvFS.
Incidentally, I really appreciate the time & effort you're taking to
educate me & Suyash. I hope I'm not being too sluggish a student, though
sometimes I am.
For the near term, Suyash and I are getting closer to an understanding of
today's ioctl(FICLONE) that we can pass along to readers in the paper
we're writing.
The overhead you are measuring is the result of all the persistent cross
referencing and reference counting metadata we need to atomically update
on each extent sharing operation ensure long term persistent clones work
correctly.
This is clear. Thanks.
If we were to implement ephemeral clones as per the mechanism you've
outlined in the papers above, then we could just copy the in-memory
extent list btree with a series of memcpy() operations because we don't
need persistent on-disk shared reference counting to implement it....
We're not on the same page about what AdvFS did.
Of course I'll understand if you don't have time or interest to get on the
same page; we understand that you're busy with a lot of important work.
Thanks for your help and Happy Holidays!
Cheers,
Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx