Re: [Lsf-pc] [LSF/MM/BPF TOPIC] bpf iterator for file-system

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 27 Apr 2023 18:54:36 +0300

On Mon, Apr 24, 2023 at 9:45 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 4/16/2023 3:55 PM, Amir Goldstein wrote:
> > On Tue, Feb 28, 2023 at 5:47 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >> From time to time, new syscalls have been proposed to gain more observability
> >> for file-system:
> >>
> >> (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
> >> multiple values in single syscall.
> >> (2) cachestat() [1].  It returns the cache status (e.g., number of dirty pages)
> >> of a given file in a scalable way.
> >>
> >> All these proposals requires adding a new syscall. Here I would like to propose
> >> another solution for file system observability: bpf iterator for file system
> >> object. The initial idea came when I was trying to implement a filefrag-like
> >> page cache tool with support for multi-order folio, so that we can know the
> >> number of multi-order folios and the orders of those folios in page cache. After
> >> developing a demo for it, I realized that we could use it to provide more
> >> observability for file system objects. e.g., dumping the per-cpu iostat for a
> >> super block [2],  iterating all inodes in a super-block to dump info for
> >> specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
> >> specific mount.
> >>
> >> The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
> >> for kernel objects. It works by creating bpf iterator file [4], which is a
> >> seq-like read-only file, and the content of the bpf iterator file is determined
> >> by a previously loaded bpf program, so userspace can read the bpf iterator file
> >> to get the information it needs. However there are some unresolved issues:
> >> (1) The privilege.
> >> Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
> >> observability will be available to the privileged process. Maybe we can load the
> >> bpf program through a privileged process and make the bpf iterator file being
> >> readable for normal users.
> >> (2) Prevent pinning the super-block
> >> In the current naive implementation, the bpf iterator simply pins the
> >> super-block of the passed fd and prevents the super-block from being destroyed.
> >> Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
> >> the filesystem is umounted.
> >>
> >> I hope to send out an RFC soon before LSF/MM/BPF for further discussion.
> > Hi Hou,
> >
> > IIUC, there is not much value in making this a cross track session.
> > Seems like an FS track session that has not much to do with BPF
> > development.
> >
> > Am I understanding correctly or are there any cross subsystem
> > interactions that need to be discussed?
> Yes. Although the patchset for file-system iterator is still not ready, but I
> think the BPF mechanisms for file-system iterator is ready, so a cross track
> session maybe unnecessary.
> >
> > Perhaps we can join you as co-speaker for Miklos' traditional
> > "fsinfo" session?
> Thanks. I am glad to be a co-speaker for fsinfo session.

All right. I put you down as a co-speaker with Miklos on the fsinfo session.

Thanks,
Amir.