Re: [TOPIC] Extending the filesystem crash recovery guaranties contract

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> Again, here is my concern.  If we promise that ext4 will always obey
> Dave Chinner's SOMC model, it would forever rule out Daejun Park and
> Dongkun Shin's "iJournaling: Fine-grained journaling for improving the
> latency of fsync system call"[1] published in Usenix ATC 2017.
>
> [1] https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf
>
> That's because this provides a fast fsync() using an incremental
> journal.  This fast fsync would cause the metadata associated with the
> inode being fsync'ed to be persisted after the crash --- ahead of
> metadata changes to other, potentially completely unrelated files,
> which would *not* be persisted after the crash.  Fine grained
> journalling would provide all of the guarantee all of the POSIX, and
> for applications that only care about the single file being fsync'ed
> -- they would be happy.  BUT, it violates the proposed crash
> consistency guarantees.
>
> So if the crash consistency guarantees forbids future innovations
> where applications might *want* a fast fsync() that doesn't drag
> unrelated inodes into the persistence guarantees, is that really what
> we want?  Do we want to forever rule out various academic
> investigations such as Park and Shin's because "it violates the crash
> consistency recovery model"?  Especially if some applications don't
> *need* the crash consistency model?
>
>                                                 - Ted
>
> P.S.  I feel especially strong about this because I'm working with an
> engineer currently trying to implement a simplified version of Park
> and Shin's proposal...  So this is not a hypothetical concern of mine.
> I'd much rather not invalidate all of this engineer's work to date,
> especially since there is a published paper demonstrating that for
> some workloads (such as sqlite), this approach can be a big win.

Ted, I sympathize with your position. To be clear, this is not what my
group or Amir is suggesting we do.

A few things to clarify:
1) We are not suggesting that all file systems follow SOMC semantics.
If ext4 does not want to do so, we are quite happy to document ext4
provides a different set of reasonable semantics. We can make the
ext4-related documentation as minimal as you want (or drop ext4 from
documentation entirely). I'm hoping this will satisfy you.
2) As I understand it, I do not think SOMC rules out the scenario in
your example, because it does not require fsync to push un-related
files to storage.
3) We are not documenting how fsync works internally, merely what the
user-visible behavior is. I think this will actually free up file
systems to optimize fsync aggressively while making sure they provide
the required user-visible behavior.

Quoting from Dave Chinner's response when you brought up this concern
previously (https://patchwork.kernel.org/patch/10849903/#22538743):

"Sure, but again this is orthognal to what we are discussing here:
the user visible ordering of metadata operations after a crash.

If anyone implements a multi-segment or per-inode journal (say, like
NOVA), then it is up to that implementation to maintain the ordering
guarantees that a SOMC model requires. You can implement whatever
fsync() go-fast bits you want, as long as it provides the ordering
behaviour guarantees that the model defines.

IOWs, Ted, I think you have the wrong end of the stick here. This
isn't about optimising fsync() to provide better performance, it's
about guaranteeing order so that fsync() is not necessary and we
improve performance by allowing applications to omit order-only
synchornisation points in their workloads.

i.e. an order-based integrity model /reduces/ the need for a
hyper-optimised fsync operation because applications won't need to
use it as often."

> P.P.S.  One of the other discussions that did happen during the main
> LSF/MM File system session, and for which there was general agreement
> across a number of major file system maintainers, was a fsync2()
> system call which would take a list of file descriptors (and flags)
> that should be fsync'ed.  The semantics would be that when the
> fsync2() successfully returns, all of the guarantees of fsync() or
> fdatasync() requested by the list of file descriptors and flags would
> be satisfied.  This would allow file systems to more optimally fsync a
> batch of files, for example by implementing data integrity writebacks
> for all of the files, followed by a single journal commit to guarantee
> persistence for all of the metadata changes.

I like this "group fsync" idea. I think this is a great way to extend
the basic fsync interface.



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux