Re: [PATCH RFC bpf-next v1 0/8] Pinning bpf objects outside bpffs

Hao Luo <haoluo@xxxxxxxxxx> · Mon, 10 Jan 2022 10:55:54 -0800

On Fri, Jan 7, 2022 at 11:25 AM <sdf@xxxxxxxxxx> wrote:
>
> On 01/07, Hao Luo wrote:
> > On Thu, Jan 6, 2022 at 3:03 PM <sdf@xxxxxxxxxx> wrote:
> > >
> > > On 01/06, Hao Luo wrote:
> > > > Bpffs is a pseudo file system that persists bpf objects. Previously
> > > > bpf objects can only be pinned in bpffs, this patchset extends pinning
> > > > to allow bpf objects to be pinned (or exposed) to other file systems.
> > >
> > > > In particular, this patchset allows pinning bpf objects in kernfs.
> > This
> > > > creates a new file entry in the kernfs file system and the created
> > file
> > > > is able to reference the bpf object. By doing so, bpf can be used to
> > > > customize the file's operations, such as seq_show.
> > >
> > > > As a concrete usecase of this feature, this patchset introduces a
> > > > simple new program type called 'bpf_view', which can be used to format
> > > > a seq file by a kernel object's state. By pinning a bpf_view program
> > > > into a cgroup directory, userspace is able to read the cgroup's state
> > > > from file in a format defined by the bpf program.
> > >
> > > > Different from bpffs, kernfs doesn't have a callback when a kernfs
> > node
> > > > is freed, which is problem if we allow the kernfs node to hold an
> > extra
> > > > reference of the bpf object, because there is no chance to dec the
> > > > object's refcnt. Therefore the kernfs node created by pinning doesn't
> > > > hold reference of the bpf object. The lifetime of the kernfs node
> > > > depends on the lifetime of the bpf object. Rather than "pinning in
> > > > kernfs", it is "exposing to kernfs". We require the bpf object to be
> > > > pinned in bpffs first before it can be pinned in kernfs. When the
> > > > object is unpinned from bpffs, their kernfs nodes will be removed
> > > > automatically. This somehow treats a pinned bpf object as a persistent
> > > > "device".
> > >
> > > > We rely on fsnotify to monitor the inode events in bpffs. A new
> > function
> > > > bpf_watch_inode() is introduced. It allows registering a callback
> > > > function at inode destruction. For the kernfs case, a callback that
> > > > removes kernfs node is registered at the destruction of bpffs inodes.
> > > > For other file systems such as sockfs, bpf_watch_inode() can monitor
> > the
> > > > destruction of sockfs inodes and the created file entry can hold the
> > bpf
> > > > object's reference. In this case, it is truly "pinning".
> > >
> > > > File operations other than seq_show can also be implemented using bpf.
> > > > For example, bpf may be of help for .poll and .mmap in kernfs.
> > >
> > > This looks awesome!
> > >
> > > One thing I don't understand is: why did go through the pinning
> > > interface VS regular attach/detach? IOW, why not allow regular
> > > sys_bpf(BPF_PROG_ATTACH, prog_id, cgroup_id) and attach to the cgroup
> > > (which, in turn, creates the kernfs nodes). Seems like this way you can
> > drop
> > > the requirement on the object being pinned in the bpffs first?
>
> > Thanks Stan.
>
> > Yeah, the attach/detach approach is definitely another option. IIUC,
> > in comparison to pinning, does attach/detach only work for cgroups?
>
> attach has target_fd argument that, in theory, can be whatever. We can
> add support for different fd types.
>

I see. With attach API, are we also able to specify some attributes
for the attachment? For example, a property that we may want is: let
descendent cgroups inherit their parent cgroup's programs.

> > Pinning may be used on other file systems, sockfs, sysfs or resctrl.
> > But I don't know whether this generality is welcome and implementing
> > seq_show is the only concrete use case I can think of right now. If
> > people think the ability of creating files in other subsystems is not
> > good, I'd be happy to take a look at the attach/detach approach and
> > that may be the right way.
>
> The reason I started thinking about attach/detach is because of clunky
> unlink that you have to do (aka echo "rm" > file). IMO, having standard
> attach/detach is a much more clear. But I might be missing some
> complexity associated with non-cgroup filesystems.

Oh, I see. Looks good. Let me play with it before sending the next version.