On 01/06, Hao Luo wrote:
Bpffs is a pseudo file system that persists bpf objects. Previously bpf objects can only be pinned in bpffs, this patchset extends pinning to allow bpf objects to be pinned (or exposed) to other file systems.
In particular, this patchset allows pinning bpf objects in kernfs. This creates a new file entry in the kernfs file system and the created file is able to reference the bpf object. By doing so, bpf can be used to customize the file's operations, such as seq_show.
As a concrete usecase of this feature, this patchset introduces a simple new program type called 'bpf_view', which can be used to format a seq file by a kernel object's state. By pinning a bpf_view program into a cgroup directory, userspace is able to read the cgroup's state from file in a format defined by the bpf program.
Different from bpffs, kernfs doesn't have a callback when a kernfs node is freed, which is problem if we allow the kernfs node to hold an extra reference of the bpf object, because there is no chance to dec the object's refcnt. Therefore the kernfs node created by pinning doesn't hold reference of the bpf object. The lifetime of the kernfs node depends on the lifetime of the bpf object. Rather than "pinning in kernfs", it is "exposing to kernfs". We require the bpf object to be pinned in bpffs first before it can be pinned in kernfs. When the object is unpinned from bpffs, their kernfs nodes will be removed automatically. This somehow treats a pinned bpf object as a persistent "device".
We rely on fsnotify to monitor the inode events in bpffs. A new function bpf_watch_inode() is introduced. It allows registering a callback function at inode destruction. For the kernfs case, a callback that removes kernfs node is registered at the destruction of bpffs inodes. For other file systems such as sockfs, bpf_watch_inode() can monitor the destruction of sockfs inodes and the created file entry can hold the bpf object's reference. In this case, it is truly "pinning".
File operations other than seq_show can also be implemented using bpf. For example, bpf may be of help for .poll and .mmap in kernfs.
This looks awesome! One thing I don't understand is: why did go through the pinning interface VS regular attach/detach? IOW, why not allow regular sys_bpf(BPF_PROG_ATTACH, prog_id, cgroup_id) and attach to the cgroup (which, in turn, creates the kernfs nodes). Seems like this way you can drop the requirement on the object being pinned in the bpffs first?