Re: XFS reflink overhead, ioctl(FICLONE)

Terence Kelly <tpkelly@xxxxxxxxxxxxxx> · Thu, 15 Dec 2022 20:06:18 -0500 (EST)

Hi Dave,

Thanks for your quick and detailed reply.  More inline....

On Thu, 15 Dec 2022, Dave Chinner wrote:

Regardless of the block device (the plot includes results for optane 
and RamFS), it seems like the ioctl(FICLONE) call is slow.

Please define "slow" - is it actually slower than it should be (i.e. a 
bug) or does it simply not perform according to your expectations?

I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took 
*milli*seconds right from the start, and grew to *tens* of milliseconds. 
There's no slow block storage device to increase latency; all of the 
latency is due to software.  I was expecting microseconds of latency with 
DRAM underneath.

Performance matters because cloning is an excellent crash-tolerance 
mechanism.  Applications that maintain persistent state in files --- 
that's a huge number of applications --- can make clones of said files and 
recover from crashes by reverting to the most recent successful clone. 
In many situations this is much easier and better than shoe-horning 
application data into something like an ACID-transactional relational 
database or transactional key-value store.  But the run-time cost of 
making a clone during failure-free operation can't be excessive.  Cloning 
for crash tolerance usually requires durable media beneath the file system 
(HDD or SSD, not DRAM), so performance on block storage devices matters 
too.  We measured performance of cloning atop DRAM to understand how much 
latency is due to block storage hardware vs. software alone.

My colleagues and I started working on clone-based crash tolerance 
mechanisms nearly a decade ago.  Extensive experience with cloning and 
related mechanisms in the HP Advanced File System (AdvFS), a Linux port of 
the DEC Tru64 file system, taught me to expect cloning to be *faster* than 
alternatives for crash tolerance:

https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf

https://web.eecs.umich.edu/~tpkelly/papers/HPL-2015-103.pdf

The point I'm trying to make is:  I'm a serious customer who loves cloning 
and my performance expectations aren't based on idle speculation but on 
experience with other cloning implementations.  (AdvFS is not open source 
and I'm no longer an HP employee, so I no longer have access to it.)

More recently I torture-tested XFS cloning as a crash-tolerance mechanism 
by subjecting it to real whole-system power interruptions:

https://dl.acm.org/doi/pdf/10.1145/3400899.3400902

I performed these correctness tests before making any performance 
measurements because I don't care how fast a mechanism is if it doesn't 
correctly tolerate crashes.  XFS passed the power-fail tests with flying 
colors.  Now it's time to consider performance.

I'm surprised that in XFS, cloning alone *without* fsync() pushes data 
down to storage.  I would have expected that the implementation of cloning 
would always operate upon memory alone, and that an explicit fsync() would 
be required to force data down to durable media.  Analogy:  write() 
doesn't modify storage; write() plus fsync() does.  Is there a reason why 
copying via ioctl(FICLONE) isn't similar?

Finally I understand your explanation that the cost of cloning is 
proportional to the size of the extent map, and that in the limit where 
the extent map is very large, cloning a file of size N requires O(N) time. 
However the constant factors surprise me.  If memory serves we were seeing 
latencies of milliseconds atop DRAM for the first few clones on files that 
began as sparse files and had only a few blocks written to them.  Copying 
the extent map on a DRAM file system must be tantamount to a bunch of 
memcpy() calls (right?), and I'm surprised that the volume of data that 
must be memcpy'd is so large that it takes milliseconds.

We might be able to take some of the additional measurements you suggested 
during/after the holidays.

Thanks again.

A few things that you can quantify to answer these questions.

...