On Fri, May 03, 2019 at 12:16:32AM -0400, Amir Goldstein wrote: > OK. we can leave that one for later. > Although I am not sure what the concern is. > If we are able to agree and document a LINK_ATOMIC flag, > what would be the down side of documenting a RENAME_ATOMIC > flag with same semantics? After all, as I said, this is what many users > already expect when renaming a temp file (as ext4 heuristics prove). The problem is if the "temp file" has been hardlinked to 1000 different directories, does the rename() have to guarantee that we have to make sure that the changes to all 1000 directories have been persisted to disk? And all of the parent directories of those 1000 directories have also *all* been persisted to disk, all the way up to the root? With the O_TMPFILE linkat case, we know that inode hasn't been hard-linked to any other directory, and mercifully directories have only one parent directory, so we only have to check one set of directory inodes all the way up to the root having been persisted. But.... I can already imagine someone complaining that if due to bind mounts and 1000 mount namespaces, there is some *other* directory pathname which could be used to reach said "tmpfile", we have to guarantee that all parent directories which could be used to reach said "tmpfile" even if they span a dozen different file systems, *also* have to be persisted due to sloppy drafting of what the atomicity rules might happen to be. If we are only guaranteeing the persistence of the containing directories of the source and destination files, that's pretty easy. But then the consistency rules need to *explicitly* state this. Some of the handwaving definitions of what would be guaranteed.... scare me. - Ted P.S. If we were going to do this, we'd probably want to simply define a flag to be AT_FSYNC, using the strict POSIX definition of fsync, which is to say, as a result of the linkat or renameat, the file in question, and its associated metadata, are guaranteed to be persisted to disk. No other guarantees about any other inode's metadata regardless of when they might be made, would be guaranteed. If people really want "global barrier" semantics, then perhaps it would be better to simply define a barrierfs(2) system call that works like syncfs(2) --- it applies to the whole file system, and guarantees that all changes made after barrierfs(2) will be visible if any changes made *after* barrierfs(2) are visible. Amir, you used "global ordering" a few times; if you really need that, let's define a new system call which guarantees that. Maybe some of the research proposals for exotic changes to SSD semantics, etc., would allow barrierfs(2) semantics to be something that we could implement more efficiently than syncfs(2). But let's make this be explicit, as opposed to some magic guarantee that falls out as a side effect of the fsync(2) system call to a single inode.