Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 28, 2019 at 11:07 PM Darrick J. Wong
<darrick.wong@xxxxxxxxxx> wrote:
>
> On Mon, May 27, 2019 at 08:26:55PM +0300, Amir Goldstein wrote:
> > New link flags to request "atomic" link.
> >
> > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
> > ---
> >
> > Hi Guys,
> >
> > Following our discussions on LSF/MM and beyond [1][2], here is
> > an RFC documentation patch.
> >
> > Ted, I know we discussed limiting the API for linking an O_TMPFILE
> > to avert the hardlinks issue, but I decided it would be better to
> > document the hardlinks non-guaranty instead. This will allow me to
> > replicate the same semantics and documentation to renameat(2).
> > Let me know how that works out for you.
> >
> > I also decided to try out two separate flags for data and metadata.
> > I do not find any of those flags very useful without the other, but
> > documenting them seprately was easier, because of the fsync/fdatasync
> > reference.  In the end, we are trying to solve a social engineering
> > problem, so this is the least confusing way I could think of to describe
> > the new API.
> >
> > First implementation of AT_ATOMIC_METADATA is expected to be
> > noop for xfs/ext4 and probably fsync for btrfs.
> >
> > First implementation of AT_ATOMIC_DATA is expected to be
> > filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs.
> >
> > Thoughts?
> >
> > Amir.
> >
> > [1] https://lwn.net/Articles/789038/
> > [2] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@xxxxxxxxxxxxxx/
> >
> >  man2/link.2 | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 51 insertions(+)
> >
> > diff --git a/man2/link.2 b/man2/link.2
> > index 649ba00c7..15c24703e 100644
> > --- a/man2/link.2
> > +++ b/man2/link.2
> > @@ -184,6 +184,57 @@ See
> >  .BR openat (2)
> >  for an explanation of the need for
> >  .BR linkat ().
> > +.TP
> > +.BR AT_ATOMIC_METADATA " (since Linux 5.x)"
> > +By default, a link operation followed by a system crash, may result in the
> > +new file name being linked with old inode metadata, such as out dated time
> > +stamps or missing extended attributes.
> > +.BR fsync (2)
> > +before linking the inode, but that involves flushing of volatile disk caches.
> > +
> > +A filesystem that accepts this flag will guaranty, that old inode metadata
> > +will not be exposed in the new linked name.
> > +Some filesystems may internally perform
> > +.BR fsync (2)
> > +before linking the inode to provide this guaranty,
> > +but often, filesystems will have a more efficient method to provide this
> > +guaranty without flushing volatile disk caches.
> > +
> > +A filesystem that accepts this flag does
> > +.BR NOT
> > +guaranty that the new file name will exist after a system crash, nor that the
> > +current inode metadata is persisted to disk.
>
> Hmmm.  I think it would be much clearer to state the two expectations in
> the same place, e.g.:
>
> "A filesystem that accepts this flag guarantees that after a successful
> call completion, the filesystem will return either (a) the version of
> the metadata that was on disk at the time the call completed; (b) a
> newer version of that metadata; or (c) -ENOENT.  In other words, a
> subsequent access of the file path will never return metadata that was
> obsolete at the time that the call completed, even if the system crashes
> immediately afterwards."

Your feedback is along the same line as Ted's feedback.
I will try out the phrasing that Ted proposed, will see how that goes...

>
> Did I get that right?  I /think/ this means that one could implement Ye
> Olde Write And Rename as:
>
> fd = open(".", O_TMPFILE...);
> write(fd);
> fsync(fd);

Certainly not fsync(), that what my "counter-fsync()" phrasing was trying to
convey. The flags are meant as a "cheaper" replacement for that fsync().

> snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd);
> linkat(AT_FDCWD, path, AT_FDCWD, "file.txt",
>         AT_EMPTY_PATH | AT_ATOMIC_DATA | AT_ATOMIC_METADATA);
>
> (Still struggling to figure out what userspace programs would use this
> for...)
>

There are generally two use cases:

1. Flushing volatile disk caches is very expensive.
In this case sync_file_range(SYNC_FILE_RANGE_WRITE_AND_WAIT)
is cheaper than fdatasync() for a preallocated file and obviously a noop
for AT_ATOMIC_METADATA in "S.O.M.C" filesystem is cheaper than
a journal commit.

2. Highly concurrent metadata workloads
Many users running  AT_ATOMIC_METADATA concurrently are much
less likely to interfere each other than many users running fsync concurrently.

IWO, when I'm using fsync() to provide the AT_ATOMIC_METADATA guarantee
on a single journal fs, other users pay the penalty for a guaranty that I didn't
ask for (i.e. persistency).

Thanks,
Amir.



[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux