On 2/8/21 5:14 PM, Dave Chinner wrote: > On Mon, Feb 08, 2021 at 04:37:26PM -0700, Jens Axboe wrote: >> On 2/8/21 4:28 PM, Dave Chinner wrote: >>> On Mon, Feb 08, 2021 at 03:18:26PM -0700, Jens Axboe wrote: >>>> Hi, >>>> >>>> Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather >>>> serious performance issue. If IOCB_NOWAIT is set, the generic/iomap >>>> iterators check for page cache presence in the given range, and return >>>> -EAGAIN if any is there. This is rather simplistic and looks like >>>> something that was never really finished. For !IOCB_NOWAIT, we simply >>>> call filemap_write_and_wait_range() to issue (if any) and wait on the >>>> range. The fact that we have page cache entries for this range does >>>> not mean that we cannot safely do O_DIRECT IO to/from it. >>>> >>>> This series adds filemap_range_needs_writeback(), which checks if >>>> we have pages in the range that do require us to call >>>> filemap_write_and_wait_range(). If we don't, then we can proceed just >>>> fine with IOCB_NOWAIT. >>> >>> Not exactly. If it is a write we are doing, we _must_ invalidate >>> the page cache pages over the range of the DIO write to maintain >>> some level of cache coherency between the DIO write and the page >>> cache contents. i.e. the DIO write makes the page cache contents >>> stale, so the page cache has to be invalidated before the DIO write >>> is started, and again when it completes to toss away racing updates >>> (mmap) while the DIO write was in flight... >>> >>> Page invalidation can block (page locks, waits on writeback, taking >>> the mmap_sem to zap page tables, etc), and it can also fail because >>> pages are dirty (e.g. writeback+invalidation racing with mmap). >>> >>> And if it fails because dirty pages then we fall back to buffered >>> IO, which serialises readers and writes and will block. >> >> Right, not disagreeing with any of that. >> >>>> The problem manifested itself in a production environment, where someone >>>> is doing O_DIRECT on a raw block device. Due to other circumstances, >>>> blkid was triggered on this device periodically, and blkid very helpfully >>>> does a number of page cache reads on the device. Now the mapping has >>>> page cache entries, and performance falls to pieces because we can no >>>> longer reliably do IOCB_NOWAIT O_DIRECT. >>> >>> If it was a DIO write, then the pages would have been invalidated >>> on the first write and the second write would issued with NOWAIT >>> just fine. >>> >>> So the problem sounds to me like DIO reads from the block device are >>> not invalidating the page cache over the read range, so they persist >>> and prevent IOCB_NOWAIT IO from being submitted. >> >> That is exactly the case I ran into indeed. >> >>> Historically speaking, this is why XFS always used to invalidate the >>> page cache for DIO - it didn't want to leave cached clean pages that >>> would prevent future DIOs from being issued concurrently because >>> coherency with the page cache caused performance issues. We >>> optimised away this invalidation because the data in the page cache >>> is still valid after a flush+DIO read, but it sounds to me like >>> there are still corner cases where "always invalidate cached pages" >>> is the right thing for DIO to be doing.... >>> >>> Not sure what the best way to go here it - the patch isn't correct >>> for NOWAIT DIO writes, but it looks necessary for reads. And I'm not >>> sure that we want to go back to "invalidate everything all the time" >>> either.... >> >> We still do the invalidation for writes with the patch for writes, >> nothing has changed there. We just skip the >> filemap_write_and_wait_range() if there's nothing to write. And if >> there's nothing to write, _hopefully_ the invalidation should go >> smoothly unless someone dirtied/locked/put-under-writeback the page >> since we did the check. But that's always going to be racy, and there's >> not a whole lot we can do about that. > > Sure, but if someone has actually mapped the range and is accessing > it, then PTEs will need zapping and mmap_sem needs to be taken in > write mode. If there's continual racing access, you've now got the > mmap_sem regularly being taken exclusively in the IOCB_NOWAIT path > and that means it will get serialised against other threads in the > task doing page faults and other mm context operations. The "needs > writeback" check you've added does nothing to alleviate this > potential blocking point for the write path. > > That's my point - you're exposing obvious new blocking points for > IOCB_NOWAIT DIO writes, not removing them. It may not happen very > often, but the whack-a-mole game you are playing with IOCB_NOWAIT is > "we found an even rarer blocking condition that it is critical to > our application". While this patch whacks this specific mole in the > read path, it also exposes the write path to another rare blocking > condition that will eventually end up being the mole that needs to > be whacked... > > Perhaps the needs-writeback optimisation should only be applied to > the DIO read path? Sure, we can do that as a first version, and then tackle the remainder of the write side as a separate thing as we need to handle invalidate inode pages separately too. I'll send out a v2 with just the read side. -- Jens Axboe