Hi Jamie, On Tue, Feb 26, 2008 at 10:43 AM, Jamie Lokier <jamie@xxxxxxxxxxxxx> wrote: > Ric Wheeler wrote: >> >>I was surprised that fsync() doesn't do this already. There was a lot >> >>of effort put into block I/O write barriers during 2.5, so that >> >>journalling filesystems can force correct write ordering, using disk >> >>flush cache commands. >> >> >> >>After all that effort, I was very surprised to notice that Linux 2.6.x >> >>doesn't use that capability to ensure fsync() flushes the disk cache >> >>onto stable storage. >> > >> >It's surprising you are surprised, given that this [lame] fsync behavior >> >has remaining consistently lame throughout Linux's history. >> >> Maybe I am confused, but isn't this is what fsync() does today whenever >> barriers are enabled (the fsync() invalidates the drive's write cache). > > No, fsync() doesn't always flush the drive's write cache. It often > does, any I think many people are under the impression it always does, > but it doesn't. > > Try this code on ext3: > > fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666); > while (1) { > char byte; > usleep (100000); > pwrite (fd, &byte, 1, 0); > fsync (fd); > } > > It will do just over 10 write ops per second on an idle system (13 on > mine), and 1 flush op per second. How did you measure write-ops and flush-ops ? Is there any tool which can be used ? I tried looking at what CONFIG_BSD_PROCESS_ACCT provides, but no luck. Sachin > > That's because ext3 fsync() only does a journal commit when the inode > has changed. The inode mtime is changed by write only with 1 second > granularity. Without a journal commit, there's no barrier, which > translates to not flushing disk write cache. > > If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write > and fsync, you'll see at least 20 write ops and 20 flush ops per > second, and you'll here the disk seeking more. That's because the > fchmod dirties the inode, so fsync() writes the inode with a journal > commit. > > It turns out even _that_ is not sufficient according to the kernel > internals. A journal commit uses an ordered request, which isn't the > same as a flush potentially, it just happens to use flush in this > instance. I'm not sure if ordered requests are actually implemented > by any drivers at the moment. If not now, they will be one day. > > We could change ext3 fsync() to always do a journal commit, and depend > on the non-existence of block drivers which do ordered (not flush) > barrier requests. But there's lots of things wrong with that. Not > least, it sucks performance for database-like applications and virtual > machines, a lot due to unnecessary seeks. That way lies wrongness. > > Rightness is to make fdatasync() work well, with a genuine flush (or > equivalent (see FUA), only when required, and not a mere ordered > barrier), no inode write, and to make sync_file_range()[*] offer the > fancier applications finer controls which reflect what they actually > need. > > [*] - or whatever. > > -- Jamie > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html