Re: [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT

Jens Axboe <axboe@xxxxxxxxx> · Mon, 8 Feb 2021 16:37:26 -0700

On 2/8/21 4:28 PM, Dave Chinner wrote:
> On Mon, Feb 08, 2021 at 03:18:26PM -0700, Jens Axboe wrote:
>> Hi,
>>
>> Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather
>> serious performance issue. If IOCB_NOWAIT is set, the generic/iomap
>> iterators check for page cache presence in the given range, and return
>> -EAGAIN if any is there. This is rather simplistic and looks like
>> something that was never really finished. For !IOCB_NOWAIT, we simply
>> call filemap_write_and_wait_range() to issue (if any) and wait on the
>> range. The fact that we have page cache entries for this range does
>> not mean that we cannot safely do O_DIRECT IO to/from it.
>>
>> This series adds filemap_range_needs_writeback(), which checks if
>> we have pages in the range that do require us to call
>> filemap_write_and_wait_range(). If we don't, then we can proceed just
>> fine with IOCB_NOWAIT.
> 
> Not exactly. If it is a write we are doing, we _must_ invalidate
> the page cache pages over the range of the DIO write to maintain
> some level of cache coherency between the DIO write and the page
> cache contents. i.e. the DIO write makes the page cache contents
> stale, so the page cache has to be invalidated before the DIO write
> is started, and again when it completes to toss away racing updates
> (mmap) while the DIO write was in flight...
> 
> Page invalidation can block (page locks, waits on writeback, taking
> the mmap_sem to zap page tables, etc), and it can also fail because
> pages are dirty (e.g. writeback+invalidation racing with mmap).
> 
> And if it fails because dirty pages then we fall back to buffered
> IO, which serialises readers and writes and will block.

Right, not disagreeing with any of that.

>> The problem manifested itself in a production environment, where someone
>> is doing O_DIRECT on a raw block device. Due to other circumstances,
>> blkid was triggered on this device periodically, and blkid very helpfully
>> does a number of page cache reads on the device. Now the mapping has
>> page cache entries, and performance falls to pieces because we can no
>> longer reliably do IOCB_NOWAIT O_DIRECT.
> 
> If it was a DIO write, then the pages would have been invalidated
> on the first write and the second write would issued with NOWAIT
> just fine.
> 
> So the problem sounds to me like DIO reads from the block device are
> not invalidating the page cache over the read range, so they persist
> and prevent IOCB_NOWAIT IO from being submitted.

That is exactly the case I ran into indeed.

> Historically speaking, this is why XFS always used to invalidate the
> page cache for DIO - it didn't want to leave cached clean pages that
> would prevent future DIOs from being issued concurrently because
> coherency with the page cache caused performance issues. We
> optimised away this invalidation because the data in the page cache
> is still valid after a flush+DIO read, but it sounds to me like
> there are still corner cases where "always invalidate cached pages"
> is the right thing for DIO to be doing....
> 
> Not sure what the best way to go here it - the patch isn't correct
> for NOWAIT DIO writes, but it looks necessary for reads. And I'm not
> sure that we want to go back to "invalidate everything all the time"
> either....

We still do the invalidation for writes with the patch for writes,
nothing has changed there. We just skip the
filemap_write_and_wait_range() if there's nothing to write. And if
there's nothing to write, _hopefully_ the invalidation should go
smoothly unless someone dirtied/locked/put-under-writeback the page
since we did the check. But that's always going to be racy, and there's
not a whole lot we can do about that.

-- 
Jens Axboe