Re: [PATCHSET v8 0/12] Uncached buffered IO

Jens Axboe <axboe@xxxxxxxxx> · Mon, 13 Jan 2025 08:34:18 -0700

(sorry missed this reply!)

On 1/7/25 8:35 PM, Andrew Morton wrote:
> On Fri, 20 Dec 2024 08:47:38 -0700 Jens Axboe <axboe@xxxxxxxxx> wrote:
> 
>> So here's a new approach to the same concent, but using the page cache
>> as synchronization. Due to excessive bike shedding on the naming, this
>> is now named RWF_DONTCACHE, and is less special in that it's just page
>> cache IO, except it prunes the ranges once IO is completed.
>>
>> Why do this, you may ask? The tldr is that device speeds are only
>> getting faster, while reclaim is not. Doing normal buffered IO can be
>> very unpredictable, and suck up a lot of resources on the reclaim side.
>> This leads people to use O_DIRECT as a work-around, which has its own
>> set of restrictions in terms of size, offset, and length of IO. It's
>> also inherently synchronous, and now you need async IO as well. While
>> the latter isn't necessarily a big problem as we have good options
>> available there, it also should not be a requirement when all you want
>> to do is read or write some data without caching.
> 
> Of course, we're doing something here which userspace could itself do:
> drop the pagecache after reading it (with appropriate chunk sizing) and
> for writes, sync the written area then invalidate it.  Possible
> added benefits from using separate threads for this.
> 
> I suggest that diligence requires that we at least justify an in-kernel
> approach at this time, please.

Conceptually yes. But you'd end up doing extra work to do it. Some of
that not so expensive, like system calls, and others more so, like LRU
manipulation. Outside of that, I do think it makes sense to expose as a
generic thing, rather than require applications needing to kick
writeback manually, reclaim manually, etc.

> And there's a possible middle-ground implementation where the kernel
> itself kicks off threads to do the drop-behind just before the read or
> write syscall returns, which will probably be simpler.  Can we please
> describe why this also isn't acceptable?

That's more of an implementation detail. I didn't test anything like
that, though we surely could. If it's better, there's no reason why it
can't just be changed to do that. My gut tells me you want the task/CPU
that just did the page cache additions to do the pruning to, that should
be more efficient than having a kworker or similar do it.

> Also, it seems wrong for a read(RWF_DONTCACHE) to drop cache if it was
> already present.  Because it was presumably present for a reason.  Does
> this implementation already take care of this?  To make an application
> which does read(/etc/passwd, RWF_DONTCACHE) less annoying?

The implementation doesn't drop pages that were already present, only
pages that got created/added to the page cache for the operation. So
that part should already work as you expect.

> Also, consuming a new page flag isn't a minor thing.  It would be nice
> to see some justification around this, and some decription of how many
> we have left.

For sure, though various discussions on this already occurred and Kirill
posted patches for unifying some of this already. It's not something I
wanted to tackle, as I think that should be left to people more familiar
with the page/folio flags and they (sometimes odd) interactions.

-- 
Jens Axboe