Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA

Amir Goldstein <amir73il@xxxxxxxxx> · Sat, 1 Jun 2019 10:21:49 +0300

On Sat, Jun 1, 2019 at 1:46 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Fri, May 31, 2019 at 12:41:36PM -0400, Theodore Ts'o wrote:
> > On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote:
> > > What do you think of:
> > >
> > > "AT_ATOMIC_DATA (since Linux 5.x)
> > > A filesystem which accepts this flag will guarantee that if the linked file
> > > name exists after a system crash, then all of the data written to the file
> > > and all of the file's metadata at the time of the linkat(2) call will be
> > > visible.
> >
> > ".... will be visible after the the file system is remounted".  (Never
> > hurts to be explicit.)
> >
> > > The way to achieve this guarantee on old kernels is to call fsync (2)
> > > before linking the file, but doing so will also results in flushing of
> > > volatile disk caches.
> > >
> > > A filesystem which accepts this flag does NOT
> > > guarantee that any of the file hardlinks will exist after a system crash,
> > > nor that the last observed value of st_nlink (see stat (2)) will persist."
> > >
> >
> > This is I think more precise:
> >
> >     This guarantee can be achieved by calling fsync(2) before linking
> >     the file, but there may be more performant ways to provide these
> >     semantics.  In particular, note that the use of the AT_ATOMIC_DATA
> >     flag does *not* guarantee that the new link created by linkat(2)
> >     will be persisted after a crash.
>
> So here's the *implementation* problem I see with this definition of
> AT_ATOMIC_DATA. After linkat(dirfd, name, AT_ATOMIC_DATA), there is
> no guarantee that the data is on disk or that the link is present.
>
> However:
>
>         linkat(dirfd, name, AT_ATOMIC_DATA);
>         fsync(dirfd);
>
> Suddenly changes all that.
>
> i.e. when we fsync(dirfd) we guarantee that "name" is present in the
> directory and because we used AT_ATOMIC_DATA it implies that the
> data pointed to by "name" must be present on disk. IOWs, what was
> once a pure directory sync operation now *must* fsync all the child
> inodes that have been linkat(AT_ATOMIC_DATA) since the last time the
> direct has been made stable.
>
> IOWs, the described AT_ATOMIC_DATA "we don't have to write the data
> during linkat() go-fast-get-out-of-gaol-free" behaviour isn't worth
> the pixels it is written on - it just moves all the complexity to
> directory fsync, and that's /already/ a behavioural minefield.

Where does it say we don't have to write the data during linkat()?
I was only talking about avoid FLUSH/FUA caused by fsync().
I wrote in commit message:
"First implementation of AT_ATOMIC_DATA is expected to be
filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs."

I failed to convey the reasoning for this flag to you.
It is *not* about the latency of the "atomic link" for the calling thread
It is about not interfering with other workloads running at the same time.

>
> IMO, the "work-around" of forcing filesystems to write back
> destination inodes during a link() operation is just nasty and will
> just end up with really weird performance anomalies occurring in
> production systems. That's not really a solution, either, especially
> as it is far, far faster for applications to use AIO_FSYNC and then
> on the completion callback run a normal linkat() operation...
>
> Hence, if I've understood these correctly, then I'll be recommending
> that XFS follows this:
>
> > We should also document that a file system which does not implement
> > this flag MUST return EINVAL if it is passed this flag to linkat(2).
>
> and returns -EINVAL to these flags because we do not have the change
> tracking infrastructure to handle these directory fsync semantics.
> I also suspect that, even if we could track this efficiently, we
> can't do the flushing atomically because of locking order
> constraints between directories, regular files, pages in the page
> cache, etc.

That is not at all what I had in mind for XFS with the flag.

>
> Given that we can already use AIO to provide this sort of ordering,
> and AIO is vastly faster than synchronous IO, I don't see any point
> in adding complex barrier interfaces that can be /easily implemented
> in userspace/ using existing AIO primitives. You should start
> thinking about expanding libaio with stuff like
> "link_after_fdatasync()" and suddenly the whole problem of
> filesystem data vs metadata ordering goes away because the
> application directly controls all ordering without blocking and
> doesn't need to care what the filesystem under it does....
>

OK. I can work with that. It's not that simple, but I will reply on
your next email, where you wrote more about this alternative.

Thanks,
Amir.