On Fri, Jan 31, 2020 at 10:33 PM Omar Sandoval <osandov@xxxxxxxxxxx> wrote: > > On Fri, Jan 31, 2020 at 09:00:47AM +0200, Amir Goldstein wrote: > > On Fri, Jan 31, 2020 at 7:27 AM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > > > > > > On Sat, Jan 25, 2020 at 08:25:46AM +1100, Dave Chinner wrote: > > > > On Thu, Jan 23, 2020 at 09:47:30AM +0200, Amir Goldstein wrote: > > > > > On Thu, Jan 23, 2020 at 9:16 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > > On Thu, Jan 23, 2020 at 03:47:45AM +0000, Al Viro wrote: > > > > > > > On Wed, Jan 22, 2020 at 02:10:03PM -0800, Omar Sandoval wrote: > > > > > > > > > > > > > > > > Sorry for not reading all the thread again, some API questions: > > > > > > > > > - We intend to allow AT_REPLACE only with O_TMPFILE src. Right? > > > > > > > > > > > > > > > > I wasn't planning on having that restriction. It's not too much effort > > > > > > > > for filesystems to support it for normal files, so I wouldn't want to > > > > > > > > place an artificial restriction on a useful primitive. > > > > > > > > > > > > > > > > > I have too many gray hairs each one for implementing a "useful primitive" > > > > > that nobody asked for and bare the consequences. > > > > > Your introduction to AT_REPLACE uses O_TMPFILE. > > > > > I see no other sane use of the interface. > > > > > > > > > > > > I'm not sure; that's how we ended up with the unspeakable APIs like > > > > > > > rename(2), after all... > > > > > > > > > > > > Yet it is just rename(2) with the serial numbers filed off - > > > > > > complete with all the same data vs metadata ordering problems that > > > > > > rename(2) comes along with. i.e. it needs fsync to guarantee data > > > > > > integrity of the source file before the linkat() call is made. > > > > > > > > > > > > If we can forsee that users are going to complain that > > > > > > linkat(AT_REPLACE) using O_TMPFILE files is not atomic because it > > > > > > leaves zero length files behind after a crash just like rename() > > > > > > does, then we haven't really improved anything at all... > > > > > > > > > > > > And, really, I don't think anyone wants another API that requires > > > > > > multiple fsync calls to use correctly for crash-safe file > > > > > > replacement, let alone try to teach people who still cant rename a > > > > > > file safely how to use it.... > > > > > > > > > > > > > > > > Are you suggesting that AT_LINK_REPLACE should have some of > > > > > the semantics I posted in this RFC for AT_ATOMIC_xxx: > > > > > https://lore.kernel.org/linux-fsdevel/20190527172655.9287-1-amir73il@xxxxxxxxx/ > > > > > > > > Not directly. > > > > > > > > All I'm pointing out is that data integrity guarantees of > > > > AT_LINK_REPLACE are yet another aspect of this new feature that > > > > has not yet been specified or documented at all. > > > > > > > > And in pointing this out, I'm making an observation that the > > > > rename(2) behaviour which everyone seems to be assuming this > > > > function will replicate is a terrible model to copy/reinvent. > > > > > > > > Addressing this problem is something for the people pushing for > > > > AT_LINK_REPLACE to solve, not me.... > > > > > > Or the grumpy maintainer who will have to digest all of this. > > > > > > Can we update the documentation to admit that many people will probably > > > want to use this (and rename) as atomic swap operations? > > > > > > "The filesystem will commit the data and metadata of all files and > > > directories involved in the link operation to stable storage before the > > > call returns." > > > > > > And finally add a flag: > > > > > > "AT_LINK_EATMYDATA: If specified, the filesystem may opt out of > > > committing anything to disk." > > > > > > > I agree with Christoph that this anomaly is not a good idea, but I also > > agree with you and Dave that if an operation is meant to be used for > > atomic swap, we should make it much harder for users to get it wrong. > > > > To that end, we can define both flags AT_LINK_REPLACE and > > AT_ATOMIC in uapi, but also define the combo > > AT_LINK_ATOMIC_REPLACE and let the documentation be > > very much focused on this usage. > > > > I would like to stress a point, which some who haven't followed [1] > > closely might have missed - the purpose of AT_ATOMIC is > > *not* to make the new link durable - it is *not* a replacement for fsync > > on parent dir. It's purpose is *only* to create a dependency between the > > durability of the new link and the new data it can expose. > > At the risk of spawning another bikeshed subthread, the naming could be > improved to make that more clear. AT_BARRIER? AT_DEPENDENCY? > It crossed my mind a few times, I may have even suggested it, but IMO this will not help anybody outside the kernel developers community at getting a better understanding of what the API means. > > AFAICT, this means that AT_ATOMIC would be implemented > > as filemap_write_and_wait() in xfs/ext4 and probably fdatasync in btrfs. > > > > Really, there is no chance that any user is interested in a non-atomic > > replace in that respect, so I am not even sure that we need an explicit > > flag for it. As it is, the AT_REPLACE API design would rank poorly as > > "The obvious use is wrong." > > > > An explicit AT_ATOMIC flag would help is that we could make the same > > semantics also apply to rename(2) with the same flag. > > > > Omar, > > > > I do understand why you say that you want to implement AT_REPLACE > > and let someone else take care of (bike shedding) AT_ATOMIC later. > > I'm not in much of a rush to get AT_LINK_REPLACE in, so I'd be okay > getting it in together with AT_ATOMIC. I don't think I have the > bandwidth to get yet another VFS interface in, though (since I still > have to get back to RWF_ENCODED, which is a higher priority for me). Are > you planning on picking up AT_ATOMIC any time soon? > I wasn't planning to. Mind you, there is no actual "work" to do with AT_ATOMIC - it is only about agreeing on the semantics - the implementation is trivial. The man page draft I have already posted, but as I said it would be much more valuable IMO to introduce AT_LINK_ATOMIC_REPLACE together to man page. I can help with that if needed. > > The problem with that approach is that man page will have the > > AT_REPLACE documentation spread out without the mention of > > AT_ATOMIC_REPLACE and that can generate damage long into the future. > > On the other hand, I believe I made this fairly clear in my man-page > patch: > > +This does not guarantee data integrity; > +see EXAMPLE below for how to use this for crash-safe file replacement with > +.BR O_TMPFILE . > > With the fsync() calls in the example. > Heh that is not how humans work ;-) People don't go reading the entire man page nor follow examples to the letter. Many get to AT_LINK_REPLACE description and stop reading. In that case, I prefer that they first get to AT_LINK_ATOMIC_REPLACE. I don't mind re-posting your patches adding the _ATOMIC_ keyword and implementation, but it's a very small change, so perhaps you'd rather do it yourself, or maybe Dave would want to carry this through if he is in a bigger rush. Thanks, Amir.