On Wed, Jan 17, 2018 at 02:07:22PM -0800, Linus Torvalds wrote: > > Now re-do the test while another process writes to a totally unrelated > a huge file (say, do a ISO file copy or something). > > That was the thing that several filesystems get completely and > horribly wrong. Generally _particularly_ the logging filesystems that > don't even need the fsync, because they use a single log for > everything (so fsync serializes all the writes, not just the writes to > the one file it's fsync'ing). Well, let's be fair; this is something *ext3* got wrong, and it was the default file system back them. All of the modern file systems now do delayed allocation, which means that an fsync of one file doesn't actually imply an fsync of another file. Hence... > The original git design was very much to write each object file > without any syncing, because they don't matter since a new object file > - by definition - isn't really reachable. Then sync before writing the > index file or a new ref. This isn't really safe any more. Yes, there's a single log. But files which are subject to delayed allocation are in the page cache, and just because you fsync the index file doesn't mean that the object file is now written to disk. It was true for ext3, but it's not true for ext4, xfs, btrfs, etc. The good news is that if you have another process downloading a huge ISO image, the fsync of the index file won't force the ISO file to be written out. The bad news is that it won't force out the other git object files, either. Now, there is a potential downside of fsync'ing each object file, and that is the cost of doing a CACHE FLUSH on a HDD is non-trivial, and even on a SSD, it's not optimal to call CACHE FLUSH thousands of times in a second. So if you are creating thousands of tiny files, and you fsync each one, each fsync(2) call is a serializing instruction, which means it won't return until that one file is written to disk. If you are writing lots of small files, and you are using a HDD, you'll be bottlenecked to around 30 files per second on a 5400 RPM HDD, and this is true regardless of what file system you use, because the bottle neck is the CACHE FLUSH operation, and how you organize the metadata and how you do the block allocation, is largely lost in the noise compared to the CACHE FLUSH command, which serializes everything. There are solutions to this; you could simply not call fsync(2) a thousand times, and instead write a pack file, and call fsync once on the pack file. That's probably the smartest approach. You could also create a thousand threads, and call fsync(2) on those thousand threads at roughly the same time. Or you could use a bleeding edge kernel with the latest AIO patch, and use the newly added IOCB_CMD_FSYNC support. But I'd simply recommend writing a pack and fsync'ing the pack, instead of trying to write a gazillion object files. (git-repack -A, I'm looking at you....) - Ted