Hi, On 2019-12-13 01:32:10 +0000, Chris Mason wrote: > Grepping through the code shows a wonderful assortment of helpers to > control the cache, and RWF_UNCACHED would be both cleaner and faster > than what we have today. I'm on the fence about asking for > RWF_FILE_RANGE_WRITE (+/- naming) to force writes to start without > pitching pages, but we can talk to some service owners to see how useful > that would be. They can always chain a sync_file_range() in io_uring, > but RWF_ would be lower overhead if it were a common pattern. FWIW, for postgres something that'd allow us to do writes that a) Doesn't remove pages from the pagecache if they're already there. b) Doesn't delay writeback to some unpredictable point later. The later write causes both latency issues, and often under-utilizes write bandwidth for a while. For most cases where we write, we know that we're not likely to write the same page again soon. c) Doesn't (except maybe temporarily) bring pages into the pagecache, if they weren't before. In the cases where the page previously wasn't in the page cache, and we wrote it out, it's very likely to have been resident for long enough in our cache, that the kernel caching it for the future isn't useful. would be really helpful. Right now we simulate that to some degree by doing normal buffered writes followed by sync_file_range(WRITE). For most environments always using O_DIRECT isn't really an option for us, as we can't rely on settings being tuned well enough (i.e. using a large enough application cache), as well as continuing to want to support setups where using a large enough postgres buffer cache isn't an option because it'd prevent putting a number of variably used database servers on one piece of hardware. (There's also postgres side issues preventing us from doing O_DIRECT performantly, partially because we so far couldn't rely on AIO, due to also using buffered IO, but we're fixing that now.) For us a per-request interface where we'd have to fulfill all the requirements for O_DIRECT, but where neither reads nor writes would cause a page to move in/out of the pagecache, would be optimal for a good part of our IO. Especially when we still could get zero-copy IO for the pretty common case that there's no pagecache presence for a file at all. That'd allow us to use zero copy writes for the common case of a file's data fitting entirely in our cache, and us only occasionally writing the deta out at checkpoints. And do zero copy reads for the the cases where we know it's unnecessary for the kernel to cache (e.g. because we are scanning a few TB of data on a machine with less memory, because we're re-filling our cache after a restart, or because it's a maintenance operation doing the reading). But still rely on the kernel page cache for other reads where the kernel caching when memory is available is a huge benefit. Some well tuned workloads would turn that off, to only use O_DIRECT, but everyone else would benefit with that being the default. We can concoct an approximation of that behaviour with a mix of sync_file_range() (to force writeback), RWF_NOWAIT (to see if we should read with O_DIRECT) and mmap()/mincore()/munmap() (to determine if writes should use O_DIRECT). But that's quite a bit of overhead. The reason that specifying this on a per-request basis would be useful is mainly that that would allow us to avoid having to either have two sets of FDs, or having to turn O_DIRECT on/off with fcntl. Greetings, Andres Freund