On Mon, Feb 27, 2023 at 7:42 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > From time to time, new syscalls have been proposed to gain more observability > for file-system: > > (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return > multiple values in single syscall. > (2) cachestat() [1]. It returns the cache status (e.g., number of dirty pages) > of a given file in a scalable way. > > All these proposals requires adding a new syscall. Here I would like to propose > another solution for file system observability: bpf iterator for file system > object. The initial idea came when I was trying to implement a filefrag-like > page cache tool with support for multi-order folio, so that we can know the > number of multi-order folios and the orders of those folios in page cache. After > developing a demo for it, I realized that we could use it to provide more > observability for file system objects. e.g., dumping the per-cpu iostat for a > super block [2], iterating all inodes in a super-block to dump info for > specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a > specific mount. > > The BPF iterator was introduced in v5.8 [3] to support flexible content dumping > for kernel objects. It works by creating bpf iterator file [4], which is a > seq-like read-only file, and the content of the bpf iterator file is determined > by a previously loaded bpf program, so userspace can read the bpf iterator file > to get the information it needs. However there are some unresolved issues: > (1) The privilege. > Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the > observability will be available to the privileged process. Maybe we can load the > bpf program through a privileged process and make the bpf iterator file being > readable for normal users. That's possible today. Once you load BPF iter program and pin it in BPF FS, you can chown/chmod pinned file to give access to it to unprivileged processes. > (2) Prevent pinning the super-block > In the current naive implementation, the bpf iterator simply pins the > super-block of the passed fd and prevents the super-block from being destroyed. > Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after > the filesystem is umounted. > > I hope to send out an RFC soon before LSF/MM/BPF for further discussion. > > [0]: > https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@xxxxxxxxxxxxxxxxxxxxxxxxx/ > [1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@xxxxxxxxx/ > [2]: > https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@xxxxxxxxxxxxxx/ > [3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@xxxxxx/ > [4]: https://docs.kernel.org/bpf/bpf_iterators.html >