Re: [TOPIC] Extending the filesystem crash recovery guaranties contract

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 9 May 2019 12:58:45 +1000

On Wed, May 08, 2019 at 10:20:13PM -0400, Theodore Ts'o wrote:
> On Thu, May 09, 2019 at 11:43:27AM +1000, Dave Chinner wrote:
> > 
> > .... the whole point of SOMC is that allows filesystems to avoid
> > dragging external metadata into fsync() operations /unless/ there's
> > a user visible ordering dependency that must be maintained between
> > objects.  If all you are doing is stabilising file data in a stable
> > file/directory, then independent, incremental journaling of the
> > fsync operations on that file fit the SOMC model just fine.
> 
> Well, that's not what Vijay's crash consistency guarantees state.  It
> guarantees quite a bit more than what you've written above.  Which is
> my concern.

SOMC does not defining crash consistency rules - it defines change
dependecies and how ordering and atomicity impact the dependency
graph. How other people have interpreted that is out of my control.

> It came up as suggested alternative during Ric Wheeler's "Async all
> the things" session.  The problem he was trying to address was
> programs (perhaps userspace file servers) who need to fsync a large
> number of files at the same time.  The problem with his suggested
> solution (which we have for AIO and io_uring already) of having the
> program issue a large number of asynchronous fsync's and then waiting
> for them all, is that the back-end interface is a work queue, so there
> is a lot of effective serialization that takes place.

We got linear scaling out to device bandwidth and/or IOPS limits
with bulk fsync benchmarks on XFS with that simple workqueue
implementation.

If there's problems, then I'd suggest that people should be
reporting bugs to the developers of the AIO_FSYNC code (i.e.
Christoph and myself) or providing patches to improve it so these
problems go away.

A new syscall with essentially the same user interface doesn't
guarantee that these implementation problems will be solved.

> > > The semantics would be that when the
> > > fsync2() successfully returns, all of the guarantees of fsync() or
> > > fdatasync() requested by the list of file descriptors and flags would
> > > be satisfied.  This would allow file systems to more optimally fsync a
> > > batch of files, for example by implementing data integrity writebacks
> > > for all of the files, followed by a single journal commit to guarantee
> > > persistence for all of the metadata changes.
> > 
> > What happens when you get writeback errors on only some of the fds?
> > How do you report the failures and what do you do with the journal
> > commit on partial success?
> 
> Well, one approach would be to pass back the errors in the structure.
> Say something like this:
> 
>      int fsync2(int len, struct fsync_req[]);
> 
>      struct fsync_req {
>           int	fd;        /* IN */
> 	  int	flags;	   /* IN */
> 	  int	retval;    /* OUT */
>      };

So it's essentially identical to the AIO_FSYNC interface, except
that it is synchronous.

> As far as what do you do with the journal commit on partial success,
> this are no atomic, "all or nothing" guarantees with this interface.
> It is implementation specific whether there would be one or more file
> system commits necessary before fsync2 returned.

IOWs, same guarantees as AIO_FSYNC.

> > Of course, this ignores the elephant in the room: applications can
> > /already do this/ using AIO_FSYNC and have individual error status
> > for each fd. Not to mention that filesystems already batch
> > concurrent fsync journal commits into a single operation. I'm not
> > seeing the point of a new syscall to do this right now....
> 
> But it doesn't work very well, because the implementation uses a
> workqueue.

Then fix the fucking implementation!

Sheesh! Did LSFMM include a free lobotomy for participants, or
something?

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx