(sorry missed this reply!) On 1/7/25 8:35 PM, Andrew Morton wrote: > On Fri, 20 Dec 2024 08:47:38 -0700 Jens Axboe <axboe@xxxxxxxxx> wrote: > >> So here's a new approach to the same concent, but using the page cache >> as synchronization. Due to excessive bike shedding on the naming, this >> is now named RWF_DONTCACHE, and is less special in that it's just page >> cache IO, except it prunes the ranges once IO is completed. >> >> Why do this, you may ask? The tldr is that device speeds are only >> getting faster, while reclaim is not. Doing normal buffered IO can be >> very unpredictable, and suck up a lot of resources on the reclaim side. >> This leads people to use O_DIRECT as a work-around, which has its own >> set of restrictions in terms of size, offset, and length of IO. It's >> also inherently synchronous, and now you need async IO as well. While >> the latter isn't necessarily a big problem as we have good options >> available there, it also should not be a requirement when all you want >> to do is read or write some data without caching. > > Of course, we're doing something here which userspace could itself do: > drop the pagecache after reading it (with appropriate chunk sizing) and > for writes, sync the written area then invalidate it. Possible > added benefits from using separate threads for this. > > I suggest that diligence requires that we at least justify an in-kernel > approach at this time, please. Conceptually yes. But you'd end up doing extra work to do it. Some of that not so expensive, like system calls, and others more so, like LRU manipulation. Outside of that, I do think it makes sense to expose as a generic thing, rather than require applications needing to kick writeback manually, reclaim manually, etc. > And there's a possible middle-ground implementation where the kernel > itself kicks off threads to do the drop-behind just before the read or > write syscall returns, which will probably be simpler. Can we please > describe why this also isn't acceptable? That's more of an implementation detail. I didn't test anything like that, though we surely could. If it's better, there's no reason why it can't just be changed to do that. My gut tells me you want the task/CPU that just did the page cache additions to do the pruning to, that should be more efficient than having a kworker or similar do it. > Also, it seems wrong for a read(RWF_DONTCACHE) to drop cache if it was > already present. Because it was presumably present for a reason. Does > this implementation already take care of this? To make an application > which does read(/etc/passwd, RWF_DONTCACHE) less annoying? The implementation doesn't drop pages that were already present, only pages that got created/added to the page cache for the operation. So that part should already work as you expect. > Also, consuming a new page flag isn't a minor thing. It would be nice > to see some justification around this, and some decription of how many > we have left. For sure, though various discussions on this already occurred and Kirill posted patches for unifying some of this already. It's not something I wanted to tackle, as I think that should be left to people more familiar with the page/folio flags and they (sometimes odd) interactions. -- Jens Axboe