Thanks all for your answers, it is really helpful, I have now a clearer vision of how it works. I have one last question. If I’m in the process of creating 1000 files this way, but the server crashes before the syncfs() function was called, what will happen to the files that were already rename()/linkat()? Do they follow the same ordering, so I’m sure they are either complete (all data/xattr + xfs metadata) or not in the destination directory? Or, is syncfs() the only way to ensure this ordering? Thanks a lot for your time. > Le 17 oct. 2018 à 03:16, Dave Chinner <david@xxxxxxxxxxxxx> a écrit : > > On Tue, Oct 16, 2018 at 10:22:18AM +0000, Romain Le Disez wrote: >> Hi all, >> >> In this pseudo-code (extracted from OpenStack Swift [1]): >> fd=open("/tmp/tempfile", O_CREAT | O_WRONLY); >> write(fd, ...); >> fsetxattr(fd, ...); >> fsync(fd); >> rename("/tmp/tempfile", "/data/foobar"); >> dirfd = open("/data", O_DIRECTORY | O_RDONLY); >> fsync(dirfd); >> >> OR (the same without temporary file): >> fd=open("/data", O_TMPFILE | O_WRONLY); >> write(fd, ...); >> fsetxattr(fd, ...); >> fsync(fd); >> linkat(AT_FDCWD, "/proc/self/fd/" + fd, AT_FDCWD, "/data/foobar", AT_SYMLINK_FOLLOW); > > linkat(fd, "", AT_FDCWD, "/data/foobar", AT_EMPTY_PATH); > >> dirfd = open("/data", O_DIRECTORY | O_RDONLY); >> fsync(dirfd); > >> I’m guaranteed that, what ever happen, I’ll have a >> complete file (data+xattr) or no file at all in the directory >> /data. > > Yes. > >> Second question, if I replace the two fsync() by one sync(), do I >> get the same guarantee? >> fd=open("/data", O_TMPFILE | O_WRONLY); >> write(fd, ...); >> fsetxattr(fd, ...); >> linkat(AT_FDCWD, « /proc/self/fd/" + fd, AT_FDCWD, "/data/foobar", AT_SYMLINK_FOLLOW); >> sync(); >> >> From what I understand of the FAQ [1], write_barrier guarantee >> that journal (aka log) will be written before the inode (aka >> metadata). Did I miss something? > > "write barriers" don't exist anymore. What we have these days are > cache flushes to correctly order data/metadata IO vs journal IO. > > The syncfs() operation (and sync(), which is just syncfs() across > all filesystems) writes oustanding data first, then asks the > filesystem to force metadata to stable storage. XFS does that with > a log flush, which issues a cache flush (data now on stable storage) > followed by FUA log writes (metadata now on stable storage in the > journal). > > So, effectively, you get the same thing in both cases. The only > difference is that sync() does a lot more work than a couple of > fsync() operations, and does work system wide on filesystems and > files you don't care about. fsync() will always perform better on a > busy system than a sync call. > > Let the filesystem worry about optimising fsync calls necessary for > consistency and integrity purposes. If there was a faster way than > issuing fsync on only the objects that need it when required, then > everyone would be using it all the time.... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx -- Romain