wish list for Santa (was: Re: XFS reflink overhead, ioctl(FICLONE))

Terence Kelly <tpkelly@xxxxxxxxxxxxxx> · Wed, 21 Dec 2022 18:07:48 -0500 (EST)

Hi Dave,

To answer your question below:

When we sent our observations about ioctl(FICLONE) performance recently, 
starting this e-mail thread, we were hoping for one of several outcomes: 
Perhaps we were misusing the feature, in which case guidance on how to 
obtain better performance would be helpful.  Or if we're not doing 
anything wrong, an explanation of why ioctl(FICLONE) isn't as fast as we 
expected based on experience with the clone-based crash-tolerance 
mechanism in AdvFS.  In recent days we've been getting the latter, for 
which we are grateful.  We may try to pass along your explanations in a 
paper we're writing; if so we'll offer y'all the opportunity to review 
this paper and ask if you'd like to be acknowledged.

In the longer term, we're very interested in any developments related to 
crash tolerance.  The details of interfaces are less important as long as 
user-level applications can with reasonable convenience and performance 
obtain a simple guarantee:  Following a power failure or other crash a 
file can always be restored to a state that the application deemed 
consistent (application-level invariants & correctness criteria hold). 
Ideally the application would like a synchronous function call whose 
successful return provides the consistent-recoverability guarantee for the 
current state of the file.  That's the guarantee that the original 
failure-atomic msync() of EuroSys 2013 provided.

Obtaining this guarantee with ioctl(FICLONE) is quite convenient:  When 
the application knows that the file is in a consistent state, the 
application makes a clone and stashes the clone in a safe place.  Loosely 
speaking, the performance desired is that the work of cloning should be 
"O(delta) not O(data)", i.e., the time and effort required to make & stash 
a clone should be proportional to the amount of data in the file changed 
between consecutive clones, not to the logical size of the entire file. 
I gather from our recent correspondence that XFS cloning today requires 
O(data) time and effort, not O(delta).  Which is progress; we have a much 
better understanding of what's going on under the hood.

We understand that you're volunteers and that you're busy with many 
important matters.  We're not asking for any further work, though we'll 
surely applaud from the sidelines any improvements toward crash tolerance.

I've been thinking about alternative approaches to crash tolerance for 
over a decade.  In practice today people use things like relational 
databases and transactional key-value stores to protect application data 
integrity from crashes. I'm interested in other approaches, including but 
not limited to failure-atomic msync() and the moral equivalents thereof 
implemented with help from file systems.  I've worked on a half-dozen 
variants of this theme and I'd be happy to explain why I think this area 
is exciting to anyone willing to listen.  In a nutshell I look forward to 
the day when file systems render relational databases and transactional 
key-value stores obsolete for some (not all) use cases.

Thanks again for your extraordinary help clarifying matters, which goes 
above & beyond the call of duty, and happy holidays!

-- Terence

On Tue, 20 Dec 2022, Dave Chinner wrote:

I mainly want to emphasize that nobody is asking for the behavior of 
AdvFS in that FAST 2015 paper.

OK, so what are you asking us to do, then?