Hi Andrew, On Tue, Mar 14, 2023 at 04:00:41PM -0700, Andrew Morton wrote: > On Tue, 7 Mar 2023 19:27:45 -0800 Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > > There is currently no good way to query the page cache state of large > > file sets and directory trees. There is mincore(), but it scales poorly: > > the kernel writes out a lot of bitmap data that userspace has to > > aggregate, when the user really doesn not care about per-page information > > in that case. The user also needs to mmap and unmap each file as it goes > > along, which can be quite slow as well. > > A while ago I asked about the security implications - could cachestat() > be used to figure out what parts of a file another user is reading. > This also applies to mincore(), but cachestat() newly permits user A to > work out which parts of a file user B has *written* to. The caller of cachestat() must have the file open for reading. If they can read the contents that B has written, is the fact that they can see dirty state really a concern? Nhat and I were discussing this offlist at the time, but weren't creative enough to come up with an abuse scenario. > I don't recall seeing a response to this, and there is no discussion in > the changelogs. It might have drowned in the noise, but he did reply: https://lore.kernel.org/lkml/CAKEwX=Ppf=WbOuV2Rh3+V8ohOYXo=CnfSu9qqSh-DpVvfy2nhA@xxxxxxxxxxxxxx/ > Secondly, I'm not seeing description of any use cases. OK, it's faster > and better than mincore(), but who cares? In other words, what > end-user value compels us to add this feature to Linux? Years ago there was a thread about adding dirty bits to mincore(), I don't know if you remember this: https://lkml.org/lkml/2013/2/10/162 In that thread, Rusty described a usecase of maintaining a journaling file alongside a main file. The idea for testing the dirty state isn't to call sync but to see whether the journal needs to be updated. The efficiency of mincore() was touched on too. Andres Freund (CC'd, hopefully I got the email address right) mentioned that Postgres has a usecase for deciding whether to do an index scan or query tables directly, based on whether the index is cached. Postgres works with files rather than memory regions, and Andres mentioned that the index could be quite large. The consensus was that having to go through mmap(), and getting a bytemap representing each page when all you need is a summary for the queried range, was too painful in practice. Most recently, the database team at Meta reached out to us and asked about the ability to query dirty state again. The motivation for this was twofold. One was simply visibility into the writeback algorithm, i.e. trying to figure out what it's doing when investigating performance problems. The second usecase they brought up was to advise writeback from userspace to manage the tradeoff between integrity and IO utilization: if IO capacity is available, sync more frequently; if not, let the work batch up. Blindly syncing through the file in chunks doesn't work because you don't know in advance how much IO they'll end up doing (or how much they've done, afterwards.) So it's difficult to build an algorithm that will reasonably pace through sparsely dirtied regions without the risk of overwhelming the IO device on dense ones. And it's not straight-forward to do this from the kernel, since it doesn't know the IO headroom the application needs for reading (which is dynamic). The page cache is often the biggest memory consumer, and so the kernel heuristics that manage it have a big impact on performance. We have a rich interface to augment those heuristics with fadvise and the sync family, but it's not a stretch to say that it's difficult to use them if you cannot get good insights into what the other hand is doing. Another query we get almost monthly is service owners trying to understand where their memory is going and what's causing unexpected pressure on a host. They see the cache in vmstat, but between a complex application, shared libraries or a runtime (jvm, hhvm etc.) and a myriad of host management agents, there is so much going on on the machine that it's hard to find out who is touching which files. When it comes to disk usage, the kernel provides the ability to quickly stat entire filesystem subtrees and drill down with tools like du. It sure would be useful to have the same for memory usage. Our current cache interface is seriously lacking in that regard. It would be great to have a stable, canonical and versatile interface to inspect what the cache is doing. One that blends in with the broader VFS and buffered IO interface: an easy to discover, easy to use syscall (not an obscure tracepoint or fcntl or a drgn script); an fd instead of a vma; a VFS-based permission model; efficient handling of the wide range of file sizes that exist in the real world. cachestat() fits that bill. > > struct cachestat { > > __u64 nr_cache; > > __u64 nr_dirty; > > __u64 nr_writeback; > > __u64 nr_evicted; > > __u64 nr_recently_evicted; > > }; > > And these fields are really getting into the weedy details of internal > kernel implementation. Bear in mind that we must support this API for > ever. > > Particularly the "evicted" things. The workingset code was implemented > eight years ago, which is actually relatively recent. It could be that > eight years from now it will have been removed and possibly replaced > workingset with something else. Then what do we do? ;) I'm definitely biased here, but I don't think it's realistic that we'd ever go back to a cache that doesn't maintain *some* form of non-residency information. We now have two reclaim implementations that rely on it at its core. And psi is designed around the concept of initial faults vs refaults; that's an ABI we have to maintain indefinitely anyway, and is widely used for OOM killing and load shedding in datacenters, on Android, by all systemd-based installations etc. It seems unlikely that this is a fluke. But even if I'm completely wrong about that, I think we have options that wouldn't spell the end of the world. We could report 0 for those fields and be perfectly backward compatible. There is a flags field that allows versioning of struct cachestat, too.