* Dave Chinner (david@xxxxxxxxxxxxx) wrote: > On Wed, Jul 03, 2013 at 10:53:08AM -0400, Jeff Moyer wrote: > > Mel Gorman <mgorman@xxxxxxx> writes: > > > > >> > I just tried replacing my sync_file_range()+fadvise() calls and instead > > >> > pass the O_DIRECT flag to open(). Unfortunately, I must be doing > > >> > something very wrong, because I get only 1/3rd of the throughput, and > > >> > the page cache fills up. Any idea why ? > > >> > > >> Since O_DIRECT does not seem to provide acceptable throughput, it may be > > >> interesting to investigate other ways to lessen the latency impact of > > >> the fadvise DONTNEED hint. > > >> > > > > > > There are cases where O_DIRECT falls back to buffered IO which is why you > > > might have found that page cache was still filling up. There are a few > > > reasons why this can happen but I would guess the common cause is that > > > the range of pages being written was in the page cache already and could > > > not be invalidated for some reason. I'm guessing this is the common case > > > for page cache filling even with O_DIRECT but would not bet money on it > > > as it's not a problem I investigated before. > > > > Even when O_DIRECT falls back to buffered I/O for writes, it will > > invalidate the page cache range described by the buffered I/O once it > > completes. For reads, the range is written out synchronously before the > > direct I/O is issued. Either way, you shouldn't see the page cache > > filling up. > > <sigh> > > I keep forgetting that filesystems other than XFS have sub-optimal > direct IO implementations. I wish that "silent fallback to buffered > IO" idea had never seen the light of day, and that filesystems > implemented direct IO properly. > > > Switching to O_DIRECT often incurs a performance hit, especially if the > > application does not submit more than one I/O at a time. Remember, > > you're not getting readahead, and you're not getting the benefit of the > > writeback code submitting batches of I/O. > > With the way IO is being done, there won't be any readahead (write > only workload) and they are directly controlling writeback one chunk > at a time, so there's not writeback caching to do batching, either. > There's no obvious reason that direct IO should be any slower > assuming that the application is actually doing 1MB sized and > aligned IOs like was mentioned, because both methods are directly > dispatching and then waiting for IO completion. As a clarification, I use 256kB "chunks" (sub-buffers) in my tests, not 1MB. Also, please note that since I'm using splice(), each individual splice call is internally limited to 16 pages worth of data transfer (64kB). > What filesystem is in use here? My test was performed on ext3 filesystem, that was itself sitting on raid-1 software raid. Thanks, Mathieu > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html