Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 31, 2020 at 10:33 PM Omar Sandoval <osandov@xxxxxxxxxxx> wrote:
>
> On Fri, Jan 31, 2020 at 09:00:47AM +0200, Amir Goldstein wrote:
> > On Fri, Jan 31, 2020 at 7:27 AM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
> > >
> > > On Sat, Jan 25, 2020 at 08:25:46AM +1100, Dave Chinner wrote:
> > > > On Thu, Jan 23, 2020 at 09:47:30AM +0200, Amir Goldstein wrote:
> > > > > On Thu, Jan 23, 2020 at 9:16 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Thu, Jan 23, 2020 at 03:47:45AM +0000, Al Viro wrote:
> > > > > > > On Wed, Jan 22, 2020 at 02:10:03PM -0800, Omar Sandoval wrote:
> > > > > > >
> > > > > > > > > Sorry for not reading all the thread again, some API questions:
> > > > > > > > > - We intend to allow AT_REPLACE only with O_TMPFILE src. Right?
> > > > > > > >
> > > > > > > > I wasn't planning on having that restriction. It's not too much effort
> > > > > > > > for filesystems to support it for normal files, so I wouldn't want to
> > > > > > > > place an artificial restriction on a useful primitive.
> > > > > > >
> > > > >
> > > > > I have too many gray hairs each one for implementing a "useful primitive"
> > > > > that nobody asked for and bare the consequences.
> > > > > Your introduction to AT_REPLACE uses O_TMPFILE.
> > > > > I see no other sane use of the interface.
> > > > >
> > > > > > > I'm not sure; that's how we ended up with the unspeakable APIs like
> > > > > > > rename(2), after all...
> > > > > >
> > > > > > Yet it is just rename(2) with the serial numbers filed off -
> > > > > > complete with all the same data vs metadata ordering problems that
> > > > > > rename(2) comes along with. i.e. it needs fsync to guarantee data
> > > > > > integrity of the source file before the linkat() call is made.
> > > > > >
> > > > > > If we can forsee that users are going to complain that
> > > > > > linkat(AT_REPLACE) using O_TMPFILE files is not atomic because it
> > > > > > leaves zero length files behind after a crash just like rename()
> > > > > > does, then we haven't really improved anything at all...
> > > > > >
> > > > > > And, really, I don't think anyone wants another API that requires
> > > > > > multiple fsync calls to use correctly for crash-safe file
> > > > > > replacement, let alone try to teach people who still cant rename a
> > > > > > file safely how to use it....
> > > > > >
> > > > >
> > > > > Are you suggesting that AT_LINK_REPLACE should have some of
> > > > > the semantics I posted in this RFC  for AT_ATOMIC_xxx:
> > > > > https://lore.kernel.org/linux-fsdevel/20190527172655.9287-1-amir73il@xxxxxxxxx/
> > > >
> > > > Not directly.
> > > >
> > > > All I'm pointing out is that data integrity guarantees of
> > > > AT_LINK_REPLACE are yet another aspect of this new feature that
> > > > has not yet been specified or documented at all.
> > > >
> > > > And in pointing this out, I'm making an observation that the
> > > > rename(2) behaviour which everyone seems to be assuming this
> > > > function will replicate is a terrible model to copy/reinvent.
> > > >
> > > > Addressing this problem is something for the people pushing for
> > > > AT_LINK_REPLACE to solve, not me....
> > >
> > > Or the grumpy maintainer who will have to digest all of this.
> > >
> > > Can we update the documentation to admit that many people will probably
> > > want to use this (and rename) as atomic swap operations?
> > >
> > > "The filesystem will commit the data and metadata of all files and
> > > directories involved in the link operation to stable storage before the
> > > call returns."
> > >
> > > And finally add a flag:
> > >
> > > "AT_LINK_EATMYDATA: If specified, the filesystem may opt out of
> > > committing anything to disk."
> > >
> >
> > I agree with Christoph that this anomaly is not a good idea, but I also
> > agree with you and Dave that if an operation is meant to be used for
> > atomic swap, we should make it much harder for users to get it wrong.
> >
> > To that end, we can define both flags AT_LINK_REPLACE and
> > AT_ATOMIC in uapi, but also define the combo
> > AT_LINK_ATOMIC_REPLACE and let the documentation be
> > very much focused on this usage.
> >
> > I would like to stress a point, which some who haven't followed [1]
> > closely might have missed - the purpose of AT_ATOMIC is
> > *not* to make the new link durable - it is *not* a replacement for fsync
> > on parent dir. It's purpose is *only* to create a dependency between the
> > durability of the new link and the new data it can expose.
>
> At the risk of spawning another bikeshed subthread, the naming could be
> improved to make that more clear. AT_BARRIER? AT_DEPENDENCY?
>

It crossed my mind a few times, I may have even suggested it, but IMO
this will not help anybody outside the kernel developers community at
getting a better understanding of what the API means.

> > AFAICT, this means that AT_ATOMIC would be implemented
> > as filemap_write_and_wait() in xfs/ext4 and probably fdatasync in btrfs.
> >
> > Really, there is no chance that any user is interested in a non-atomic
> > replace in that respect, so I am not even sure that we need an explicit
> > flag for it. As it is, the AT_REPLACE API design would rank poorly as
> > "The obvious use is wrong."
> >
> > An explicit AT_ATOMIC flag would help is that we could make the same
> > semantics also apply to rename(2) with the same flag.
> >
> > Omar,
> >
> > I do understand why you say that you want to implement AT_REPLACE
> > and let someone else take care of (bike shedding) AT_ATOMIC later.
>
> I'm not in much of a rush to get AT_LINK_REPLACE in, so I'd be okay
> getting it in together with AT_ATOMIC. I don't think I have the
> bandwidth to get yet another VFS interface in, though (since I still
> have to get back to RWF_ENCODED, which is a higher priority for me). Are
> you planning on picking up AT_ATOMIC any time soon?
>

I wasn't planning to. Mind you, there is no actual "work" to do with
AT_ATOMIC - it is only about agreeing on the semantics - the implementation
is trivial. The man page draft I have already posted, but as I said it would
be much more valuable IMO to introduce AT_LINK_ATOMIC_REPLACE
together to man page. I can help with that if needed.

> > The problem with that approach is that man page will have the
> > AT_REPLACE documentation spread out without the mention of
> > AT_ATOMIC_REPLACE and that can generate damage long into the future.
>
> On the other hand, I believe I made this fairly clear in my man-page
> patch:
>
> +This does not guarantee data integrity;
> +see EXAMPLE below for how to use this for crash-safe file replacement with
> +.BR O_TMPFILE .
>
> With the fsync() calls in the example.
>

Heh that is not how humans work ;-)
People don't go reading the entire man page nor follow examples to the letter.
Many get to AT_LINK_REPLACE description and stop reading.
In that case, I prefer that they first get to AT_LINK_ATOMIC_REPLACE.

I don't mind re-posting your patches adding the _ATOMIC_ keyword
and implementation, but it's a very small change, so perhaps you'd
rather do it yourself, or maybe Dave would want to carry this through
if he is in a bigger rush.

Thanks,
Amir.



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux