Re: [PATCH v11 0/3] cachestat: a new syscall for page cache state of files

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 15 Mar 2023 13:09:34 -0400

Hi Andrew,

On Tue, Mar 14, 2023 at 04:00:41PM -0700, Andrew Morton wrote:
> On Tue,  7 Mar 2023 19:27:45 -0800 Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> 
> > There is currently no good way to query the page cache state of large
> > file sets and directory trees. There is mincore(), but it scales poorly:
> > the kernel writes out a lot of bitmap data that userspace has to
> > aggregate, when the user really doesn not care about per-page information
> > in that case. The user also needs to mmap and unmap each file as it goes
> > along, which can be quite slow as well.
> 
> A while ago I asked about the security implications - could cachestat()
> be used to figure out what parts of a file another user is reading. 
> This also applies to mincore(), but cachestat() newly permits user A to
> work out which parts of a file user B has *written* to.

The caller of cachestat() must have the file open for reading. If they
can read the contents that B has written, is the fact that they can
see dirty state really a concern?

Nhat and I were discussing this offlist at the time, but weren't
creative enough to come up with an abuse scenario.

> I don't recall seeing a response to this, and there is no discussion in
> the changelogs.

It might have drowned in the noise, but he did reply:

https://lore.kernel.org/lkml/CAKEwX=Ppf=WbOuV2Rh3+V8ohOYXo=CnfSu9qqSh-DpVvfy2nhA@xxxxxxxxxxxxxx/

> Secondly, I'm not seeing description of any use cases.  OK, it's faster
> and better than mincore(), but who cares?  In other words, what
> end-user value compels us to add this feature to Linux?

Years ago there was a thread about adding dirty bits to mincore(), I
don't know if you remember this:

https://lkml.org/lkml/2013/2/10/162

In that thread, Rusty described a usecase of maintaining a journaling
file alongside a main file. The idea for testing the dirty state isn't
to call sync but to see whether the journal needs to be updated.

The efficiency of mincore() was touched on too. Andres Freund (CC'd,
hopefully I got the email address right) mentioned that Postgres has a
usecase for deciding whether to do an index scan or query tables
directly, based on whether the index is cached. Postgres works with
files rather than memory regions, and Andres mentioned that the index
could be quite large. The consensus was that having to go through
mmap(), and getting a bytemap representing each page when all you need
is a summary for the queried range, was too painful in practice.

Most recently, the database team at Meta reached out to us and asked
about the ability to query dirty state again. The motivation for this
was twofold. One was simply visibility into the writeback algorithm,
i.e. trying to figure out what it's doing when investigating
performance problems.

The second usecase they brought up was to advise writeback from
userspace to manage the tradeoff between integrity and IO utilization:
if IO capacity is available, sync more frequently; if not, let the
work batch up. Blindly syncing through the file in chunks doesn't work
because you don't know in advance how much IO they'll end up doing (or
how much they've done, afterwards.) So it's difficult to build an
algorithm that will reasonably pace through sparsely dirtied regions
without the risk of overwhelming the IO device on dense ones. And it's
not straight-forward to do this from the kernel, since it doesn't know
the IO headroom the application needs for reading (which is dynamic).

The page cache is often the biggest memory consumer, and so the kernel
heuristics that manage it have a big impact on performance. We have a
rich interface to augment those heuristics with fadvise and the sync
family, but it's not a stretch to say that it's difficult to use them
if you cannot get good insights into what the other hand is doing.

Another query we get almost monthly is service owners trying to
understand where their memory is going and what's causing unexpected
pressure on a host. They see the cache in vmstat, but between a
complex application, shared libraries or a runtime (jvm, hhvm etc.)
and a myriad of host management agents, there is so much going on on
the machine that it's hard to find out who is touching which
files. When it comes to disk usage, the kernel provides the ability to
quickly stat entire filesystem subtrees and drill down with tools like
du. It sure would be useful to have the same for memory usage.

Our current cache interface is seriously lacking in that regard.

It would be great to have a stable, canonical and versatile interface
to inspect what the cache is doing. One that blends in with the
broader VFS and buffered IO interface: an easy to discover, easy to
use syscall (not an obscure tracepoint or fcntl or a drgn script); an
fd instead of a vma; a VFS-based permission model; efficient handling
of the wide range of file sizes that exist in the real world.

cachestat() fits that bill.

> >    struct cachestat {
> >	        __u64 nr_cache;
> >	        __u64 nr_dirty;
> >	        __u64 nr_writeback;
> >	        __u64 nr_evicted;
> >	        __u64 nr_recently_evicted;
> >    };
> 
> And these fields are really getting into the weedy details of internal
> kernel implementation.  Bear in mind that we must support this API for
> ever.
> 
> Particularly the "evicted" things.  The workingset code was implemented
> eight years ago, which is actually relatively recent.  It could be that
> eight years from now it will have been removed and possibly replaced
> workingset with something else.  Then what do we do?

;) I'm definitely biased here, but I don't think it's realistic that
we'd ever go back to a cache that doesn't maintain *some* form of
non-residency information.

We now have two reclaim implementations that rely on it at its
core. And psi is designed around the concept of initial faults vs
refaults; that's an ABI we have to maintain indefinitely anyway, and
is widely used for OOM killing and load shedding in datacenters, on
Android, by all systemd-based installations etc.

It seems unlikely that this is a fluke. But even if I'm completely
wrong about that, I think we have options that wouldn't spell the end
of the world. We could report 0 for those fields and be perfectly
backward compatible. There is a flags field that allows versioning of
struct cachestat, too.