Re: XFS reflink overhead, ioctl(FICLONE)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Sun, 18 Dec 2022, Dave Chinner wrote:

https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf

Ah, now I get it. You want *anonymous ephemeral clones*, not named persistent clones. For everyone else, so they don't have to read the paper and try to work it out:

The mechanism is a hacked the O_ATOMIC path ...

No. To be clear, nobody now in 2022 is asking for the AdvFS features of the FAST 2015 paper to be implemented in XFS (or BtrFS or any other FS).

The current XFS/BtrFS/Liunux ioctl(FICLONE) is perfect for my current and foreseeable needs, except for performance.

I cited the FAST 2015 paper simply to show that I've worked with a clone-based mechanism in the past and it delighted me in every way. It's simply an existence proof that cloning can be delightful for crash tolerance.

Hence the difference in functionality is that FICLONE provides persistent, unrestricted named clones rather than ephemeral clones.

For the record, the AdvFS implementation of clone-based crash tolerance --- the moral equivalent of failure-atomic msync(), which was the topic of my EuroSys 2013 paper --- involved persistent files on durable storage; the files were hidden and were discarded when their usefulness was over but the hidden files were not "ephemeral" in the sense of a file in a DRAM-backed file system (/tmp/ or /dev/shm/ or whatnot). AdvFS crash tolerance survived real power failures. But this is a side issue of historical interest only.

I mainly want to emphasize that nobody is asking for the behavior of AdvFS in that FAST 2015 paper.

We could implement ephemeral clones in XFS, but nobody has ever mentioned needing or wanting such functionality until this thread.

Nobody needs or wants such functionality, even in this thread. The current ioctl(FICLONE) is perfect except for performance.

https://dl.acm.org/doi/pdf/10.1145/3400899.3400902

Heh. You're still using hardware to do filesystem power fail testing? We moved away from needing hardware to do power fail testing of filesystems several years ago.

Using functionality like dm-logwrites, we can simulate the effect of several hundred different power fail cases with write-by-write replay and recovery in the space of a couple of minutes.

Cool. I assume you're familiar with a paper on a similar technique that my HP Labs colleagues wrote circa 2013 or 2014: "Torturing Databases for Fun and Profit."

Not only that, failures are fully replayable and so we can actually debug every single individual failure without having to guess at the runtime context that created the failure or the recovery context that exposed the failure.

This infrastructure has provided us with a massive step forward for improving crash resilence and recovery capability in ext4, btrfs and XFS. These tests are built into automated tests suites (e.g. fstests) that pretty much all linux fs engineers and distro QE teams run these days.

If you think the world would benefit from reading about this technique and using it more widely, I might be able to help. My column in _Queue_ magazine reaches thousands of readers, sometimes tens of thousands. It's about teaching better techniques to working programmers.

I'd be honored to help pass along to my readers practical techniques that you're using to improve quality.

IOWs, hardware based power fail testing of filesystems is largely obsolete these days....

I don't mind telling the world that my own past work is obsolete. That's what progress is all about.

I'm surprised that in XFS, cloning alone *without* fsync() pushes data down to storage. I would have expected that the implementation of cloning would always operate upon memory alone, and that an explicit fsync() would be required to force data down to durable media. Analogy: write() doesn't modify storage; write() plus fsync() does. Is there a reason why copying via ioctl(FICLONE) isn't similar?

Because FICLONE provides a persistent named clone that is a fully functioning file in it's own right. That means it has to be completely indepedent of the source file by the time the FICLONE operation completes. This implies that there is a certain order to the operations the clone performances - the data has to be on disk before the clone is made persistent and recoverable so that both files as guaranteed to have identical contents if we crash immediately after the clone completes.

I thought the rule was that if an application doesn't call fsync() or msync(), no durability of any kind is guaranteed. I thought modern file systems did all their work in DRAM until an explicit fsync/msync or other necessity compelled them to push data down to durable media (in the right order etc.).

Also, we might be using terminology differently:

I use "persistent" in the sense of "outlives processes". Files in /tmp/ and /dev/shm/ are persistent, but not durable.

I use "durable" to mean "written to non-volatile media (HDD or SSD) in such a way as to guarantee that it will survive power cycling."

I expect *persistence* from ioctl(FICLONE) but I didn't expect a *durability* guarantee without fsync(). If I'm understanding you correctly, cloning in XFS gives us durability whether we want it or not.

Finally I understand your explanation that the cost of cloning is proportional to the size of the extent map, and that in the limit where the extent map is very large, cloning a file of size N requires O(N) time. However the constant factors surprise me. If memory serves we were seeing latencies of milliseconds atop DRAM for the first few clones on files that began as sparse files and had only a few blocks written to them. Copying the extent map on a DRAM file system must be tantamount to a bunch of memcpy() calls (right?),

At the IO layer, yes, it's just a memcpy.

But we can't just copy a million extents from one in-memory btree to another. We have to modify the filesystem metadata in an atomic, transactional, recoverable way. Those transactions work one extent at a time because each extent might require a different set of modifications.

Ah, so now I see where the time goes.  This is clear.

Persistent clones require tracking of the number of times a given block on disk is shared so that we know when extent removals result in the extent no longer being shared and/or referenced. A file that has been cloned a million times might have a million extents each shared a different number of times. When we remove one of those clones, how do we know which blocks are now unreferenced and need to be freed?

IOWs, named persistent clones are *much more complex* than ephemeral clones.

Again, I don't know where you're getting "ephemeral" from; that word does not appear in the FAST '15 paper. The AdvFS clones of the FAST '15 paper were both durable and persistent; they were just hidden from the user-visible namespace. A crash (power outage or whatever ) caused a file to revert to the most recent hidden clone. In AdvFS, a hidden clone was created by an fsync/msync call. This is how AdvFS made file updates failure-atomic.

Again, we're not asking for the same functionality of the FAST '15 paper.

However if the contrast between what AdvFS did with clones and how XFS works illuminates issues like XFS performance, then it might be worth understanding AdvFS.

Incidentally, I really appreciate the time & effort you're taking to educate me & Suyash. I hope I'm not being too sluggish a student, though sometimes I am.

For the near term, Suyash and I are getting closer to an understanding of today's ioctl(FICLONE) that we can pass along to readers in the paper we're writing.

The overhead you are measuring is the result of all the persistent cross referencing and reference counting metadata we need to atomically update on each extent sharing operation ensure long term persistent clones work correctly.

This is clear.  Thanks.

If we were to implement ephemeral clones as per the mechanism you've outlined in the papers above, then we could just copy the in-memory extent list btree with a series of memcpy() operations because we don't need persistent on-disk shared reference counting to implement it....

We're not on the same page about what AdvFS did.

Of course I'll understand if you don't have time or interest to get on the same page; we understand that you're busy with a lot of important work.

Thanks for your help and Happy Holidays!

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux