Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED

Andres Freund <andres@xxxxxxxxxxx> · Sat, 1 Feb 2020 02:33:42 -0800

Hi,

On 2019-12-13 01:32:10 +0000, Chris Mason wrote:
> Grepping through the code shows a wonderful assortment of helpers to
> control the cache, and RWF_UNCACHED would be both cleaner and faster
> than what we have today.  I'm on the fence about asking for
> RWF_FILE_RANGE_WRITE (+/- naming) to force writes to start without
> pitching pages, but we can talk to some service owners to see how useful
> that would be.   They can always chain a sync_file_range() in io_uring,
> but RWF_ would be lower overhead if it were a common pattern.

FWIW, for postgres something that'd allow us to do writes that

a) Doesn't remove pages from the pagecache if they're already there.
b) Doesn't delay writeback to some unpredictable point later.

   The later write causes both latency issues, and often under-utilizes
   write bandwidth for a while. For most cases where we write, we know
   that we're not likely to write the same page again soon.

c) Doesn't (except maybe temporarily) bring pages into the pagecache, if
   they weren't before.

   In the cases where the page previously wasn't in the page cache, and
   we wrote it out, it's very likely to have been resident for long
   enough in our cache, that the kernel caching it for the future isn't
   useful.

would be really helpful. Right now we simulate that to some degree by
doing normal buffered writes followed by sync_file_range(WRITE).

For most environments always using O_DIRECT isn't really an option for
us, as we can't rely on settings being tuned well enough (i.e. using a
large enough application cache), as well as continuing to want to
support setups where using a large enough postgres buffer cache isn't an
option because it'd prevent putting a number of variably used database
servers on one piece of hardware.

(There's also postgres side issues preventing us from doing O_DIRECT
performantly, partially because we so far couldn't rely on AIO, due to
also using buffered IO, but we're fixing that now.)

For us a per-request interface where we'd have to fulfill all the
requirements for O_DIRECT, but where neither reads nor writes would
cause a page to move in/out of the pagecache, would be optimal for a
good part of our IO. Especially when we still could get zero-copy IO for
the pretty common case that there's no pagecache presence for a file at
all.

That'd allow us to use zero copy writes for the common case of a file's
data fitting entirely in our cache, and us only occasionally writing the
deta out at checkpoints. And do zero copy reads for the the cases where
we know it's unnecessary for the kernel to cache (e.g. because we are
scanning a few TB of data on a machine with less memory, because we're
re-filling our cache after a restart, or because it's a maintenance
operation doing the reading). But still rely on the kernel page cache
for other reads where the kernel caching when memory is available is a
huge benefit.  Some well tuned workloads would turn that off, to only
use O_DIRECT, but everyone else would benefit with that being the
default.

We can concoct an approximation of that behaviour with a mix of
sync_file_range() (to force writeback), RWF_NOWAIT (to see if we should
read with O_DIRECT) and mmap()/mincore()/munmap() (to determine if
writes should use O_DIRECT). But that's quite a bit of overhead.

The reason that specifying this on a per-request basis would be useful
is mainly that that would allow us to avoid having to either have two
sets of FDs, or having to turn O_DIRECT on/off with fcntl.

Greetings,

Andres Freund