Re: [PATCHSET v2 0/3] Improve IOCB_NOWAIT O_DIRECT reads

Jens Axboe <axboe@xxxxxxxxx> · Wed, 10 Feb 2021 07:47:49 -0700

On 2/10/21 1:07 AM, Sedat Dilek wrote:
> On Tue, Feb 9, 2021 at 10:25 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>
>> On 2/9/21 12:55 PM, Andrew Morton wrote:
>>> On Mon,  8 Feb 2021 19:30:05 -0700 Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>
>>>> Hi,
>>>>
>>>> For v1, see:
>>>>
>>>> https://lore.kernel.org/linux-fsdevel/20210208221829.17247-1-axboe@xxxxxxxxx/
>>>>
>>>> tldr; don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache
>>>> entries for the given range. This causes unnecessary work from the callers
>>>> side, when the IO could have been issued totally fine without blocking on
>>>> writeback when there is none.
>>>>
>>>
>>> Seems a good idea.  Obviously we'll do more work in the case where some
>>> writeback needs doing, but we'll be doing synchronous writeout in that
>>> case anyway so who cares.
>>
>> Right, I think that'll be a round two on top of this, so we can make the
>> write side happier too. That's a bit more involved...
>>
>>> Please remind me what prevents pages from becoming dirty during or
>>> immediately after the filemap_range_needs_writeback() check?  Perhaps
>>> filemap_range_needs_writeback() could have a comment explaining what it
>>> is that keeps its return value true after it has returned it!
>>
>> It's inherently racy, just like it is now. There's really no difference
>> there, and I don't think there's a way to close that. Even if you
>> modified filemap_write_and_wait_range() to be non-block friendly,
>> there's nothing stopping anyone from adding dirty page cache right after
>> that call.
>>
> 
> Jens, do you have some numbers before and after your patchset is applied?

I don't, the load was pretty light for the test case - it was just doing
33-34K of O_DIRECT 4k random reads in a pretty small range of the device.
When you end up having page cache in that range, that means you end up
punting a LOT of requests to the async worker. So it wasn't as much a
performance win for this particular case, but an efficiency win. You get
rid of a worker using 40% CPU, and reduce the latencies.

> And kindly a test "profile" for FIO :-)?

To reproduce this, have a small range dio rand reads and then have
something else that does a few buffered reads from the same range.

-- 
Jens Axboe