On Fri, May 31, 2019 at 12:41:36PM -0400, Theodore Ts'o wrote: > On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote: > > What do you think of: > > > > "AT_ATOMIC_DATA (since Linux 5.x) > > A filesystem which accepts this flag will guarantee that if the linked file > > name exists after a system crash, then all of the data written to the file > > and all of the file's metadata at the time of the linkat(2) call will be > > visible. > > ".... will be visible after the the file system is remounted". (Never > hurts to be explicit.) > > > The way to achieve this guarantee on old kernels is to call fsync (2) > > before linking the file, but doing so will also results in flushing of > > volatile disk caches. > > > > A filesystem which accepts this flag does NOT > > guarantee that any of the file hardlinks will exist after a system crash, > > nor that the last observed value of st_nlink (see stat (2)) will persist." > > > > This is I think more precise: > > This guarantee can be achieved by calling fsync(2) before linking > the file, but there may be more performant ways to provide these > semantics. In particular, note that the use of the AT_ATOMIC_DATA > flag does *not* guarantee that the new link created by linkat(2) > will be persisted after a crash. So here's the *implementation* problem I see with this definition of AT_ATOMIC_DATA. After linkat(dirfd, name, AT_ATOMIC_DATA), there is no guarantee that the data is on disk or that the link is present. However: linkat(dirfd, name, AT_ATOMIC_DATA); fsync(dirfd); Suddenly changes all that. i.e. when we fsync(dirfd) we guarantee that "name" is present in the directory and because we used AT_ATOMIC_DATA it implies that the data pointed to by "name" must be present on disk. IOWs, what was once a pure directory sync operation now *must* fsync all the child inodes that have been linkat(AT_ATOMIC_DATA) since the last time the direct has been made stable. IOWs, the described AT_ATOMIC_DATA "we don't have to write the data during linkat() go-fast-get-out-of-gaol-free" behaviour isn't worth the pixels it is written on - it just moves all the complexity to directory fsync, and that's /already/ a behavioural minefield. IMO, the "work-around" of forcing filesystems to write back destination inodes during a link() operation is just nasty and will just end up with really weird performance anomalies occurring in production systems. That's not really a solution, either, especially as it is far, far faster for applications to use AIO_FSYNC and then on the completion callback run a normal linkat() operation... Hence, if I've understood these correctly, then I'll be recommending that XFS follows this: > We should also document that a file system which does not implement > this flag MUST return EINVAL if it is passed this flag to linkat(2). and returns -EINVAL to these flags because we do not have the change tracking infrastructure to handle these directory fsync semantics. I also suspect that, even if we could track this efficiently, we can't do the flushing atomically because of locking order constraints between directories, regular files, pages in the page cache, etc. Given that we can already use AIO to provide this sort of ordering, and AIO is vastly faster than synchronous IO, I don't see any point in adding complex barrier interfaces that can be /easily implemented in userspace/ using existing AIO primitives. You should start thinking about expanding libaio with stuff like "link_after_fdatasync()" and suddenly the whole problem of filesystem data vs metadata ordering goes away because the application directly controls all ordering without blocking and doesn't need to care what the filesystem under it does.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx