Re: [PATCH RFC bpf-next v1 0/8] Pinning bpf objects outside bpffs

Hao Luo <haoluo@xxxxxxxxxx> · Fri, 7 Jan 2022 10:59:07 -0800

On Thu, Jan 6, 2022 at 3:03 PM <sdf@xxxxxxxxxx> wrote:
>
> On 01/06, Hao Luo wrote:
> > Bpffs is a pseudo file system that persists bpf objects. Previously
> > bpf objects can only be pinned in bpffs, this patchset extends pinning
> > to allow bpf objects to be pinned (or exposed) to other file systems.
>
> > In particular, this patchset allows pinning bpf objects in kernfs. This
> > creates a new file entry in the kernfs file system and the created file
> > is able to reference the bpf object. By doing so, bpf can be used to
> > customize the file's operations, such as seq_show.
>
> > As a concrete usecase of this feature, this patchset introduces a
> > simple new program type called 'bpf_view', which can be used to format
> > a seq file by a kernel object's state. By pinning a bpf_view program
> > into a cgroup directory, userspace is able to read the cgroup's state
> > from file in a format defined by the bpf program.
>
> > Different from bpffs, kernfs doesn't have a callback when a kernfs node
> > is freed, which is problem if we allow the kernfs node to hold an extra
> > reference of the bpf object, because there is no chance to dec the
> > object's refcnt. Therefore the kernfs node created by pinning doesn't
> > hold reference of the bpf object. The lifetime of the kernfs node
> > depends on the lifetime of the bpf object. Rather than "pinning in
> > kernfs", it is "exposing to kernfs". We require the bpf object to be
> > pinned in bpffs first before it can be pinned in kernfs. When the
> > object is unpinned from bpffs, their kernfs nodes will be removed
> > automatically. This somehow treats a pinned bpf object as a persistent
> > "device".
>
> > We rely on fsnotify to monitor the inode events in bpffs. A new function
> > bpf_watch_inode() is introduced. It allows registering a callback
> > function at inode destruction. For the kernfs case, a callback that
> > removes kernfs node is registered at the destruction of bpffs inodes.
> > For other file systems such as sockfs, bpf_watch_inode() can monitor the
> > destruction of sockfs inodes and the created file entry can hold the bpf
> > object's reference. In this case, it is truly "pinning".
>
> > File operations other than seq_show can also be implemented using bpf.
> > For example, bpf may be of help for .poll and .mmap in kernfs.
>
> This looks awesome!
>
> One thing I don't understand is: why did go through the pinning
> interface VS regular attach/detach? IOW, why not allow regular
> sys_bpf(BPF_PROG_ATTACH, prog_id, cgroup_id) and attach to the cgroup
> (which, in turn, creates the kernfs nodes). Seems like this way you can drop
> the requirement on the object being pinned in the bpffs first?

Thanks Stan.

Yeah, the attach/detach approach is definitely another option. IIUC,
in comparison to pinning, does attach/detach only work for cgroups?
Pinning may be used on other file systems, sockfs, sysfs or resctrl.
But I don't know whether this generality is welcome and implementing
seq_show is the only concrete use case I can think of right now. If
people think the ability of creating files in other subsystems is not
good, I'd be happy to take a look at the attach/detach approach and
that may be the right way.