On 2/8/21 4:28 PM, Dave Chinner wrote: > On Mon, Feb 08, 2021 at 03:18:26PM -0700, Jens Axboe wrote: >> Hi, >> >> Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather >> serious performance issue. If IOCB_NOWAIT is set, the generic/iomap >> iterators check for page cache presence in the given range, and return >> -EAGAIN if any is there. This is rather simplistic and looks like >> something that was never really finished. For !IOCB_NOWAIT, we simply >> call filemap_write_and_wait_range() to issue (if any) and wait on the >> range. The fact that we have page cache entries for this range does >> not mean that we cannot safely do O_DIRECT IO to/from it. >> >> This series adds filemap_range_needs_writeback(), which checks if >> we have pages in the range that do require us to call >> filemap_write_and_wait_range(). If we don't, then we can proceed just >> fine with IOCB_NOWAIT. > > Not exactly. If it is a write we are doing, we _must_ invalidate > the page cache pages over the range of the DIO write to maintain > some level of cache coherency between the DIO write and the page > cache contents. i.e. the DIO write makes the page cache contents > stale, so the page cache has to be invalidated before the DIO write > is started, and again when it completes to toss away racing updates > (mmap) while the DIO write was in flight... > > Page invalidation can block (page locks, waits on writeback, taking > the mmap_sem to zap page tables, etc), and it can also fail because > pages are dirty (e.g. writeback+invalidation racing with mmap). > > And if it fails because dirty pages then we fall back to buffered > IO, which serialises readers and writes and will block. Right, not disagreeing with any of that. >> The problem manifested itself in a production environment, where someone >> is doing O_DIRECT on a raw block device. Due to other circumstances, >> blkid was triggered on this device periodically, and blkid very helpfully >> does a number of page cache reads on the device. Now the mapping has >> page cache entries, and performance falls to pieces because we can no >> longer reliably do IOCB_NOWAIT O_DIRECT. > > If it was a DIO write, then the pages would have been invalidated > on the first write and the second write would issued with NOWAIT > just fine. > > So the problem sounds to me like DIO reads from the block device are > not invalidating the page cache over the read range, so they persist > and prevent IOCB_NOWAIT IO from being submitted. That is exactly the case I ran into indeed. > Historically speaking, this is why XFS always used to invalidate the > page cache for DIO - it didn't want to leave cached clean pages that > would prevent future DIOs from being issued concurrently because > coherency with the page cache caused performance issues. We > optimised away this invalidation because the data in the page cache > is still valid after a flush+DIO read, but it sounds to me like > there are still corner cases where "always invalidate cached pages" > is the right thing for DIO to be doing.... > > Not sure what the best way to go here it - the patch isn't correct > for NOWAIT DIO writes, but it looks necessary for reads. And I'm not > sure that we want to go back to "invalidate everything all the time" > either.... We still do the invalidation for writes with the patch for writes, nothing has changed there. We just skip the filemap_write_and_wait_range() if there's nothing to write. And if there's nothing to write, _hopefully_ the invalidation should go smoothly unless someone dirtied/locked/put-under-writeback the page since we did the check. But that's always going to be racy, and there's not a whole lot we can do about that. -- Jens Axboe