Christoph Hellwig wrote: > On Wed, Sep 23, 2009 at 03:04:30PM +0100, Jamie Lokier wrote: > > Christoph Hellwig wrote: > > > In the case of direct I/O falling back to buffered I/O we sync data > > > twice currently: once at the end of generic_file_buffered_write using > > > filemap_write_and_wait_range and once a little later in > > > __generic_file_aio_write using do_sync_mapping_range with all flags set. > > > > > > The wait before write of the do_sync_mapping_range call does not make > > > any sense, so just keep the filemap_write_and_wait_range call and move > > > it to the right spot. > > > > Are you sure this is an expectation of O_DIRECT? > > > > A few notes from the net, including some documentation from IBM, > > advise using O_DIRECT|O_DSYNC if you need sync when direct I/O falls > > back to buffered on some other OSes. > > Not sure if we're on the same page here. Before the patch we do the > following in case of a fallback buffered write when opened with > O_DIRECT: > > - a filemap_write_and_wait_range in generic_file_buffered_write for > (with a too long length, btw) > - a do_sync_mapping_range(wait,write,wait) from > __generic_file_aio_write > > which expands to > > (1) __filemap_fdatawrite_range(SYNC_ALL) > (2) wait_on_page_writeback_range > (3) wait_on_page_writeback_range > (4) __filemap_fdatawrite_range(SYNC_ALL) > (5) wait_on_page_writeback_range > > which clearly doesn't make any sense. The patch reduces it to > > (1) __filemap_fdatawrite_range(SYNC_ALL) > (2) wait_on_page_writeback_range > > which does exactly the same, that is assuring the _data_ is on disk. > That's still not O_DSNC or O_SYNC semantics, but that's an entirely > different discussion. I agree that the patch is a great improvement, and it's obviously good regardless of further discussions. What I'm suggesting is that there is no need to commit the data to the disk, and sometimes it's an unwanted pessimisation. So those calls may be removed entirely. I'll describe the O_DIRECT, non-O_DSYNC case only. (O_DIRECT|O_DSYNC should of course do fdatasync_range properly after every write). Apps which need the data to reach the disk with any assurance that they'll be able to read it after a crash _must_ use fdatasync (or O_DSYNC) to get that assurance. Your explanation makes that even clearer: Historically what Linux has done is not O_DSYNC, and this is not intended to change. Because the fallback happens when extending a file or filling a hole, it's even more likely that the metadata needed to retrieve the data won't be committed when the write returns. Regarding performance, when the fallback happens, since good apps should be using fdatasync (or O_DSYNC) already, and bad apps already don't have any _useful_ guarantee, it's preferable that the buffered writes have normal buffering behaviour and can be sensibly reordered and batched like normal writes. This is particularly relevant to the common case of extending a file, and when writing to a filesystem which doesn't support O_DIRECT at all. In a nutshell, I'm saying the only useful behaviours for direct I/O fallback are "like normal writes" or "like O_DSYNC" - that "sort of kind of in between" doesn't help anything. It's a historical, unnecessary relic which should go to the same recycling bin as O_SYNC meaning O_DSYNC on Linux (and probably the change should go alongside your sensible fixes to O_SYNC/O_DSYNC). But I'll readily agree your patch is a trivial improvement that doesn't change anything, if you want to get it in before a bigger decision is made. :-) -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html