On Thu, May 4, 2023 at 10:26 AM Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> wrote: > > Hi Nhat, > > On Wed, May 3, 2023 at 3:38 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > There is currently no good way to query the page cache state of large > > file sets and directory trees. There is mincore(), but it scales poorly: > > the kernel writes out a lot of bitmap data that userspace has to > > aggregate, when the user really doesn not care about per-page > > information in that case. The user also needs to mmap and unmap each > > file as it goes along, which can be quite slow as well. > > > > Some use cases where this information could come in handy: > > * Allowing database to decide whether to perform an index scan or > > direct table queries based on the in-memory cache state of the > > index. > > * Visibility into the writeback algorithm, for performance issues > > diagnostic. > > * Workload-aware writeback pacing: estimating IO fulfilled by page > > cache (and IO to be done) within a range of a file, allowing for > > more frequent syncing when and where there is IO capacity, and > > batching when there is not. > > * Computing memory usage of large files/directory trees, analogous to > > the du tool for disk usage. > > > > More information about these use cases could be found in the following > > thread: > > > > https://lore.kernel.org/lkml/20230315170934.GA97793@xxxxxxxxxxx/ > > > > This patch implements a new syscall that queries cache state of a file > > and summarizes the number of cached pages, number of dirty pages, number > > of pages marked for writeback, number of (recently) evicted pages, etc. > > in a given range. Currently, the syscall is only wired in for x86 > > architecture. > > > > NAME > > cachestat - query the page cache statistics of a file. > > > > SYNOPSIS > > #include <sys/mman.h> > > > > struct cachestat_range { > > __u64 off; > > __u64 len; > > }; > > > > struct cachestat { > > __u64 nr_cache; > > __u64 nr_dirty; > > __u64 nr_writeback; > > __u64 nr_evicted; > > __u64 nr_recently_evicted; > > }; > > > > int cachestat(unsigned int fd, struct cachestat_range *cstat_range, > > struct cachestat *cstat, unsigned int flags); > > > > DESCRIPTION > > cachestat() queries the number of cached pages, number of dirty > > pages, number of pages marked for writeback, number of evicted > > pages, number of recently evicted pages, in the bytes range given by > > `off` and `len`. > > > > An evicted page is a page that is previously in the page cache but > > has been evicted since. A page is recently evicted if its last > > eviction was recent enough that its reentry to the cache would > > indicate that it is actively being used by the system, and that > > there is memory pressure on the system. > > > > These values are returned in a cachestat struct, whose address is > > given by the `cstat` argument. > > > > The `off` and `len` arguments must be non-negative integers. If > > `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == > > 0, we will query in the range from `off` to the end of the file. > > > > The `flags` argument is unused for now, but is included for future > > extensibility. User should pass 0 (i.e no flag specified). > > > > Currently, hugetlbfs is not supported. > > > > Because the status of a page can change after cachestat() checks it > > but before it returns to the application, the returned values may > > contain stale information. > > > > RETURN VALUE > > On success, cachestat returns 0. On error, -1 is returned, and errno > > is set to indicate the error. > > > > ERRORS > > EFAULT cstat or cstat_args points to an invalid address. > > > > EINVAL invalid flags. > > > > EBADF invalid file descriptor. > > > > EOPNOTSUPP file descriptor is of a hugetlbfs file > > > > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx> > > --- > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > > This should be wired up on each and every architecture. > Currently we're getting > > <stdin>:1567:2: warning: #warning syscall cachestat not implemented [-Wcpp] > > in linux-next for all the missing architectures. Hi Geert, I saw that there are several instances where we have separate patches to wire up a syscall to these architectures, so I was doing something similar. For e.g: ARM: wire up process_vm_writev and process_vm_readv syscalls (e5489847d6fc0ff176048b6e1cf5034507bf703a) MIPS: Hook up process_vm_readv and process_vm_writev system calls. (8ff8584e51d4d3fbe08ede413c4a221223766323) As for these non-x86 architecture wiring patches, I can give it a shot and cross-compile to see if it builds, but I have limited abilities for runtime tests as I don't have machines with these architectures. I would really appreciate it if there are arch people that could help wire it up. (cc-ing linux-arch as well) > > Gr{oetje,eeting}s, > > Geert > > -- > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@xxxxxxxxxxxxxx > > In personal conversations with technical people, I call myself a hacker. But > when I'm talking to journalists I just say "programmer" or something like that. > -- Linus Torvalds