Re: [TOPIC] Extending the filesystem crash recovery guaranties contract

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 9 May 2019 11:43:27 +1000

On Thu, May 02, 2019 at 10:30:43PM -0400, Theodore Ts'o wrote:
> On Thu, May 02, 2019 at 01:39:47PM -0400, Amir Goldstein wrote:
> > I am not saying there is no room for a document that elaborates on those
> > guaranties. I personally think that could be useful and certainly think that
> > your group's work for adding xfstest coverage for API guaranties is useful.
> 
> Again, here is my concern.  If we promise that ext4 will always obey
> Dave Chinner's SOMC model, it would forever rule out Daejun Park and
> Dongkun Shin's "iJournaling: Fine-grained journaling for improving the
> latency of fsync system call"[1] published in Usenix ATC 2017.

No, it doesn't rule that out at all.

In a SOMC model, incremental journalling is just fine when there are
no external dependencies on the thing being fsync'd.  If you have
other dependencies (e.g. the file has just be created and so the dir
it dirty, too) then fsync would need to do the whole shebang, but
otherwise....

> So if the crash consistency guarantees forbids future innovations
> where applications might *want* a fast fsync() that doesn't drag
> unrelated inodes into the persistence guarantees,

.... the whole point of SOMC is that allows filesystems to avoid
dragging external metadata into fsync() operations /unless/ there's
a user visible ordering dependency that must be maintained between
objects.  If all you are doing is stabilising file data in a stable
file/directory, then independent, incremental journaling of the
fsync operations on that file fit the SOMC model just fine.

> is that really what
> we want?  Do we want to forever rule out various academic
> investigations such as Park and Shin's because "it violates the crash
> consistency recovery model"?  Especially if some applications don't
> *need* the crash consistency model?

Stop with the silly inflammatory hyperbole already, Ted. It is not
necessary.

> P.P.S.  One of the other discussions that did happen during the main
> LSF/MM File system session, and for which there was general agreement
> across a number of major file system maintainers, was a fsync2()
> system call which would take a list of file descriptors (and flags)
> that should be fsync'ed.

Hmmmm, that wasn't on the agenda, and nobody has documented it as
yet.

> The semantics would be that when the
> fsync2() successfully returns, all of the guarantees of fsync() or
> fdatasync() requested by the list of file descriptors and flags would
> be satisfied.  This would allow file systems to more optimally fsync a
> batch of files, for example by implementing data integrity writebacks
> for all of the files, followed by a single journal commit to guarantee
> persistence for all of the metadata changes.

What happens when you get writeback errors on only some of the fds?
How do you report the failures and what do you do with the journal
commit on partial success?

Of course, this ignores the elephant in the room: applications can
/already do this/ using AIO_FSYNC and have individual error status
for each fd. Not to mention that filesystems already batch
concurrent fsync journal commits into a single operation. I'm not
seeing the point of a new syscall to do this right now....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx