Re: [PATCH] direct I/O fallback sync simplification

Jamie Lokier <jamie@xxxxxxxxxxxxx> · Tue, 29 Sep 2009 22:30:02 +0100

Christoph Hellwig wrote:
> On Wed, Sep 23, 2009 at 03:04:30PM +0100, Jamie Lokier wrote:
> > Christoph Hellwig wrote:
> > > In the case of direct I/O falling back to buffered I/O we sync data
> > > twice currently: once at the end of generic_file_buffered_write using
> > > filemap_write_and_wait_range and once a little later in
> > > __generic_file_aio_write using do_sync_mapping_range with all flags set.
> > > 
> > > The wait before write of the do_sync_mapping_range call does not make
> > > any sense, so just keep the filemap_write_and_wait_range call and move
> > > it to the right spot.
> > 
> > Are you sure this is an expectation of O_DIRECT?
> > 
> > A few notes from the net, including some documentation from IBM,
> > advise using O_DIRECT|O_DSYNC if you need sync when direct I/O falls
> > back to buffered on some other OSes.
> 
> Not sure if we're on the same page here.  Before the patch we do the
> following in case of a fallback buffered write when opened with
> O_DIRECT:
> 
>  - a filemap_write_and_wait_range in generic_file_buffered_write for
>    (with a too long length, btw)
>  - a do_sync_mapping_range(wait,write,wait) from
>    __generic_file_aio_write
> 
> which expands to
> 
>  (1) __filemap_fdatawrite_range(SYNC_ALL)
>  (2) wait_on_page_writeback_range
>  (3) wait_on_page_writeback_range
>  (4) __filemap_fdatawrite_range(SYNC_ALL)
>  (5) wait_on_page_writeback_range
> 
> which clearly doesn't make any sense.  The patch reduces it to
>  
>  (1) __filemap_fdatawrite_range(SYNC_ALL)
>  (2) wait_on_page_writeback_range
> 
> which does exactly the same, that is assuring the _data_ is on disk.
> That's still not O_DSNC or O_SYNC semantics, but that's an entirely
> different discussion.

I agree that the patch is a great improvement, and it's obviously good
regardless of further discussions.

What I'm suggesting is that there is no need to commit the data to the
disk, and sometimes it's an unwanted pessimisation.  So those calls
may be removed entirely.

I'll describe the O_DIRECT, non-O_DSYNC case only.  (O_DIRECT|O_DSYNC
should of course do fdatasync_range properly after every write).

Apps which need the data to reach the disk with any assurance that
they'll be able to read it after a crash _must_ use fdatasync (or
O_DSYNC) to get that assurance.

Your explanation makes that even clearer: Historically what Linux has
done is not O_DSYNC, and this is not intended to change.  Because the
fallback happens when extending a file or filling a hole, it's even
more likely that the metadata needed to retrieve the data won't be
committed when the write returns.

Regarding performance, when the fallback happens, since good apps
should be using fdatasync (or O_DSYNC) already, and bad apps already
don't have any _useful_ guarantee, it's preferable that the buffered
writes have normal buffering behaviour and can be sensibly reordered
and batched like normal writes.  This is particularly relevant to the
common case of extending a file, and when writing to a filesystem
which doesn't support O_DIRECT at all.

In a nutshell, I'm saying the only useful behaviours for direct I/O
fallback are "like normal writes" or "like O_DSYNC" - that "sort of
kind of in between" doesn't help anything.  It's a historical,
unnecessary relic which should go to the same recycling bin as O_SYNC
meaning O_DSYNC on Linux (and probably the change should go alongside
your sensible fixes to O_SYNC/O_DSYNC).

But I'll readily agree your patch is a trivial improvement that
doesn't change anything, if you want to get it in before a bigger
decision is made. :-)

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html