Re: XFS reflink overhead, ioctl(FICLONE)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Hi Dave,

Thanks for your quick and detailed reply.  More inline....

On Thu, 15 Dec 2022, Dave Chinner wrote:

Regardless of the block device (the plot includes results for optane and RamFS), it seems like the ioctl(FICLONE) call is slow.

Please define "slow" - is it actually slower than it should be (i.e. a bug) or does it simply not perform according to your expectations?

I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took *milli*seconds right from the start, and grew to *tens* of milliseconds. There's no slow block storage device to increase latency; all of the latency is due to software. I was expecting microseconds of latency with DRAM underneath.

Performance matters because cloning is an excellent crash-tolerance mechanism. Applications that maintain persistent state in files --- that's a huge number of applications --- can make clones of said files and recover from crashes by reverting to the most recent successful clone. In many situations this is much easier and better than shoe-horning application data into something like an ACID-transactional relational database or transactional key-value store. But the run-time cost of making a clone during failure-free operation can't be excessive. Cloning for crash tolerance usually requires durable media beneath the file system (HDD or SSD, not DRAM), so performance on block storage devices matters too. We measured performance of cloning atop DRAM to understand how much latency is due to block storage hardware vs. software alone.

My colleagues and I started working on clone-based crash tolerance mechanisms nearly a decade ago. Extensive experience with cloning and related mechanisms in the HP Advanced File System (AdvFS), a Linux port of the DEC Tru64 file system, taught me to expect cloning to be *faster* than alternatives for crash tolerance:

https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf

https://web.eecs.umich.edu/~tpkelly/papers/HPL-2015-103.pdf

The point I'm trying to make is: I'm a serious customer who loves cloning and my performance expectations aren't based on idle speculation but on experience with other cloning implementations. (AdvFS is not open source and I'm no longer an HP employee, so I no longer have access to it.)

More recently I torture-tested XFS cloning as a crash-tolerance mechanism by subjecting it to real whole-system power interruptions:

https://dl.acm.org/doi/pdf/10.1145/3400899.3400902

I performed these correctness tests before making any performance measurements because I don't care how fast a mechanism is if it doesn't correctly tolerate crashes. XFS passed the power-fail tests with flying colors. Now it's time to consider performance.

I'm surprised that in XFS, cloning alone *without* fsync() pushes data down to storage. I would have expected that the implementation of cloning would always operate upon memory alone, and that an explicit fsync() would be required to force data down to durable media. Analogy: write() doesn't modify storage; write() plus fsync() does. Is there a reason why copying via ioctl(FICLONE) isn't similar?

Finally I understand your explanation that the cost of cloning is proportional to the size of the extent map, and that in the limit where the extent map is very large, cloning a file of size N requires O(N) time. However the constant factors surprise me. If memory serves we were seeing latencies of milliseconds atop DRAM for the first few clones on files that began as sparse files and had only a few blocks written to them. Copying the extent map on a DRAM file system must be tantamount to a bunch of memcpy() calls (right?), and I'm surprised that the volume of data that must be memcpy'd is so large that it takes milliseconds.

We might be able to take some of the additional measurements you suggested during/after the holidays.

Thanks again.

A few things that you can quantify to answer these questions.

...




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux