Re: [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT

Jens Axboe <axboe@xxxxxxxxx> · Mon, 8 Feb 2021 19:15:03 -0700

On 2/8/21 5:14 PM, Dave Chinner wrote:
> On Mon, Feb 08, 2021 at 04:37:26PM -0700, Jens Axboe wrote:
>> On 2/8/21 4:28 PM, Dave Chinner wrote:
>>> On Mon, Feb 08, 2021 at 03:18:26PM -0700, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather
>>>> serious performance issue. If IOCB_NOWAIT is set, the generic/iomap
>>>> iterators check for page cache presence in the given range, and return
>>>> -EAGAIN if any is there. This is rather simplistic and looks like
>>>> something that was never really finished. For !IOCB_NOWAIT, we simply
>>>> call filemap_write_and_wait_range() to issue (if any) and wait on the
>>>> range. The fact that we have page cache entries for this range does
>>>> not mean that we cannot safely do O_DIRECT IO to/from it.
>>>>
>>>> This series adds filemap_range_needs_writeback(), which checks if
>>>> we have pages in the range that do require us to call
>>>> filemap_write_and_wait_range(). If we don't, then we can proceed just
>>>> fine with IOCB_NOWAIT.
>>>
>>> Not exactly. If it is a write we are doing, we _must_ invalidate
>>> the page cache pages over the range of the DIO write to maintain
>>> some level of cache coherency between the DIO write and the page
>>> cache contents. i.e. the DIO write makes the page cache contents
>>> stale, so the page cache has to be invalidated before the DIO write
>>> is started, and again when it completes to toss away racing updates
>>> (mmap) while the DIO write was in flight...
>>>
>>> Page invalidation can block (page locks, waits on writeback, taking
>>> the mmap_sem to zap page tables, etc), and it can also fail because
>>> pages are dirty (e.g. writeback+invalidation racing with mmap).
>>>
>>> And if it fails because dirty pages then we fall back to buffered
>>> IO, which serialises readers and writes and will block.
>>
>> Right, not disagreeing with any of that.
>>
>>>> The problem manifested itself in a production environment, where someone
>>>> is doing O_DIRECT on a raw block device. Due to other circumstances,
>>>> blkid was triggered on this device periodically, and blkid very helpfully
>>>> does a number of page cache reads on the device. Now the mapping has
>>>> page cache entries, and performance falls to pieces because we can no
>>>> longer reliably do IOCB_NOWAIT O_DIRECT.
>>>
>>> If it was a DIO write, then the pages would have been invalidated
>>> on the first write and the second write would issued with NOWAIT
>>> just fine.
>>>
>>> So the problem sounds to me like DIO reads from the block device are
>>> not invalidating the page cache over the read range, so they persist
>>> and prevent IOCB_NOWAIT IO from being submitted.
>>
>> That is exactly the case I ran into indeed.
>>
>>> Historically speaking, this is why XFS always used to invalidate the
>>> page cache for DIO - it didn't want to leave cached clean pages that
>>> would prevent future DIOs from being issued concurrently because
>>> coherency with the page cache caused performance issues. We
>>> optimised away this invalidation because the data in the page cache
>>> is still valid after a flush+DIO read, but it sounds to me like
>>> there are still corner cases where "always invalidate cached pages"
>>> is the right thing for DIO to be doing....
>>>
>>> Not sure what the best way to go here it - the patch isn't correct
>>> for NOWAIT DIO writes, but it looks necessary for reads. And I'm not
>>> sure that we want to go back to "invalidate everything all the time"
>>> either....
>>
>> We still do the invalidation for writes with the patch for writes,
>> nothing has changed there. We just skip the
>> filemap_write_and_wait_range() if there's nothing to write. And if
>> there's nothing to write, _hopefully_ the invalidation should go
>> smoothly unless someone dirtied/locked/put-under-writeback the page
>> since we did the check. But that's always going to be racy, and there's
>> not a whole lot we can do about that.
> 
> Sure, but if someone has actually mapped the range and is accessing
> it, then PTEs will need zapping and mmap_sem needs to be taken in
> write mode. If there's continual racing access, you've now got the
> mmap_sem regularly being taken exclusively in the IOCB_NOWAIT path
> and that means it will get serialised against other threads in the
> task doing page faults and other mm context operations.  The "needs
> writeback" check you've added does nothing to alleviate this
> potential blocking point for the write path.
> 
> That's my point - you're exposing obvious new blocking points for
> IOCB_NOWAIT DIO writes, not removing them. It may not happen very
> often, but the whack-a-mole game you are playing with IOCB_NOWAIT is
> "we found an even rarer blocking condition that it is critical to
> our application". While this patch whacks this specific mole in the
> read path, it also exposes the write path to another rare blocking
> condition that will eventually end up being the mole that needs to
> be whacked...
> 
> Perhaps the needs-writeback optimisation should only be applied to
> the DIO read path?

Sure, we can do that as a first version, and then tackle the remainder
of the write side as a separate thing as we need to handle invalidate
inode pages separately too.

I'll send out a v2 with just the read side.

-- 
Jens Axboe