Re: Proposal for "proper" durable fsync() and fdatasync()

"Sachin Gaikwad" <sachin.kernel@xxxxxxxxx> · Mon, 24 Nov 2008 16:10:48 -0500

Hi Jamie,

On Tue, Feb 26, 2008 at 10:43 AM, Jamie Lokier <jamie@xxxxxxxxxxxxx> wrote:
> Ric Wheeler wrote:
>> >>I was surprised that fsync() doesn't do this already.  There was a lot
>> >>of effort put into block I/O write barriers during 2.5, so that
>> >>journalling filesystems can force correct write ordering, using disk
>> >>flush cache commands.
>> >>
>> >>After all that effort, I was very surprised to notice that Linux 2.6.x
>> >>doesn't use that capability to ensure fsync() flushes the disk cache
>> >>onto stable storage.
>> >
>> >It's surprising you are surprised, given that this [lame] fsync behavior
>> >has remaining consistently lame throughout Linux's history.
>>
>> Maybe I am confused, but isn't this is what fsync() does today whenever
>> barriers are enabled (the fsync() invalidates the drive's write cache).
>
> No, fsync() doesn't always flush the drive's write cache.  It often
> does, any I think many people are under the impression it always does,
> but it doesn't.
>
> Try this code on ext3:
>
>        fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);
>        while (1) {
>                char byte;
>                usleep (100000);
>                pwrite (fd, &byte, 1, 0);
>                fsync (fd);
>        }
>
> It will do just over 10 write ops per second on an idle system (13 on
> mine), and 1 flush op per second.

How did you measure write-ops and flush-ops ? Is there any tool which
can be used ? I tried looking at what CONFIG_BSD_PROCESS_ACCT
provides, but no luck.

Sachin

>
> That's because ext3 fsync() only does a journal commit when the inode
> has changed.  The inode mtime is changed by write only with 1 second
> granularity.  Without a journal commit, there's no barrier, which
> translates to not flushing disk write cache.
>
> If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
> and fsync, you'll see at least 20 write ops and 20 flush ops per
> second, and you'll here the disk seeking more.  That's because the
> fchmod dirties the inode, so fsync() writes the inode with a journal
> commit.
>
> It turns out even _that_ is not sufficient according to the kernel
> internals.  A journal commit uses an ordered request, which isn't the
> same as a flush potentially, it just happens to use flush in this
> instance.  I'm not sure if ordered requests are actually implemented
> by any drivers at the moment.  If not now, they will be one day.
>
> We could change ext3 fsync() to always do a journal commit, and depend
> on the non-existence of block drivers which do ordered (not flush)
> barrier requests.  But there's lots of things wrong with that.  Not
> least, it sucks performance for database-like applications and virtual
> machines, a lot due to unnecessary seeks.  That way lies wrongness.
>
> Rightness is to make fdatasync() work well, with a genuine flush (or
> equivalent (see FUA), only when required, and not a mere ordered
> barrier), no inode write, and to make sync_file_range()[*] offer the
> fancier applications finer controls which reflect what they actually
> need.
>
> [*] - or whatever.
>
> -- Jamie
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html