Re: [Lsf-pc] [LSF/MM/BPF TOPIC] bpf iterator for file-system

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Mon, 24 Apr 2023 14:45:33 +0800



Hi,

On 4/16/2023 3:55 PM, Amir Goldstein wrote:
> On Tue, Feb 28, 2023 at 5:47 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>> From time to time, new syscalls have been proposed to gain more observability
>> for file-system:
>>
>> (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
>> multiple values in single syscall.
>> (2) cachestat() [1].  It returns the cache status (e.g., number of dirty pages)
>> of a given file in a scalable way.
>>
>> All these proposals requires adding a new syscall. Here I would like to propose
>> another solution for file system observability: bpf iterator for file system
>> object. The initial idea came when I was trying to implement a filefrag-like
>> page cache tool with support for multi-order folio, so that we can know the
>> number of multi-order folios and the orders of those folios in page cache. After
>> developing a demo for it, I realized that we could use it to provide more
>> observability for file system objects. e.g., dumping the per-cpu iostat for a
>> super block [2],  iterating all inodes in a super-block to dump info for
>> specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
>> specific mount.
>>
>> The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
>> for kernel objects. It works by creating bpf iterator file [4], which is a
>> seq-like read-only file, and the content of the bpf iterator file is determined
>> by a previously loaded bpf program, so userspace can read the bpf iterator file
>> to get the information it needs. However there are some unresolved issues:
>> (1) The privilege.
>> Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
>> observability will be available to the privileged process. Maybe we can load the
>> bpf program through a privileged process and make the bpf iterator file being
>> readable for normal users.
>> (2) Prevent pinning the super-block
>> In the current naive implementation, the bpf iterator simply pins the
>> super-block of the passed fd and prevents the super-block from being destroyed.
>> Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
>> the filesystem is umounted.
>>
>> I hope to send out an RFC soon before LSF/MM/BPF for further discussion.
> Hi Hou,
>
> IIUC, there is not much value in making this a cross track session.
> Seems like an FS track session that has not much to do with BPF
> development.
>
> Am I understanding correctly or are there any cross subsystem
> interactions that need to be discussed?
Yes. Although the patchset for file-system iterator is still not ready, but I
think the BPF mechanisms for file-system iterator is ready, so a cross track
session maybe unnecessary.
>
> Perhaps we can join you as co-speaker for Miklos' traditional
> "fsinfo" session?
Thanks. I am glad to be a co-speaker for fsinfo session.
>
> Thanks,
> Amir.
>
>> [0]:
>> https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@xxxxxxxxxxxxxxxxxxxxxxxxx/
>> [1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@xxxxxxxxx/
>> [2]:
>> https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@xxxxxxxxxxxxxx/
>> [3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@xxxxxx/
>> [4]: https://docs.kernel.org/bpf/bpf_iterators.html
>>
>> _______________________________________________
>> Lsf-pc mailing list
>> Lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx
>> https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc