On Fri, 20 Dec 2024 08:47:38 -0700 Jens Axboe <axboe@xxxxxxxxx> wrote: > So here's a new approach to the same concent, but using the page cache > as synchronization. Due to excessive bike shedding on the naming, this > is now named RWF_DONTCACHE, and is less special in that it's just page > cache IO, except it prunes the ranges once IO is completed. > > Why do this, you may ask? The tldr is that device speeds are only > getting faster, while reclaim is not. Doing normal buffered IO can be > very unpredictable, and suck up a lot of resources on the reclaim side. > This leads people to use O_DIRECT as a work-around, which has its own > set of restrictions in terms of size, offset, and length of IO. It's > also inherently synchronous, and now you need async IO as well. While > the latter isn't necessarily a big problem as we have good options > available there, it also should not be a requirement when all you want > to do is read or write some data without caching. Of course, we're doing something here which userspace could itself do: drop the pagecache after reading it (with appropriate chunk sizing) and for writes, sync the written area then invalidate it. Possible added benefits from using separate threads for this. I suggest that diligence requires that we at least justify an in-kernel approach at this time, please. And there's a possible middle-ground implementation where the kernel itself kicks off threads to do the drop-behind just before the read or write syscall returns, which will probably be simpler. Can we please describe why this also isn't acceptable? Also, it seems wrong for a read(RWF_DONTCACHE) to drop cache if it was already present. Because it was presumably present for a reason. Does this implementation already take care of this? To make an application which does read(/etc/passwd, RWF_DONTCACHE) less annoying? Also, consuming a new page flag isn't a minor thing. It would be nice to see some justification around this, and some decription of how many we have left.