On Fri, Jan 7, 2022 at 11:25 AM <sdf@xxxxxxxxxx> wrote: > > On 01/07, Hao Luo wrote: > > On Thu, Jan 6, 2022 at 3:03 PM <sdf@xxxxxxxxxx> wrote: > > > > > > On 01/06, Hao Luo wrote: > > > > Bpffs is a pseudo file system that persists bpf objects. Previously > > > > bpf objects can only be pinned in bpffs, this patchset extends pinning > > > > to allow bpf objects to be pinned (or exposed) to other file systems. > > > > > > > In particular, this patchset allows pinning bpf objects in kernfs. > > This > > > > creates a new file entry in the kernfs file system and the created > > file > > > > is able to reference the bpf object. By doing so, bpf can be used to > > > > customize the file's operations, such as seq_show. > > > > > > > As a concrete usecase of this feature, this patchset introduces a > > > > simple new program type called 'bpf_view', which can be used to format > > > > a seq file by a kernel object's state. By pinning a bpf_view program > > > > into a cgroup directory, userspace is able to read the cgroup's state > > > > from file in a format defined by the bpf program. > > > > > > > Different from bpffs, kernfs doesn't have a callback when a kernfs > > node > > > > is freed, which is problem if we allow the kernfs node to hold an > > extra > > > > reference of the bpf object, because there is no chance to dec the > > > > object's refcnt. Therefore the kernfs node created by pinning doesn't > > > > hold reference of the bpf object. The lifetime of the kernfs node > > > > depends on the lifetime of the bpf object. Rather than "pinning in > > > > kernfs", it is "exposing to kernfs". We require the bpf object to be > > > > pinned in bpffs first before it can be pinned in kernfs. When the > > > > object is unpinned from bpffs, their kernfs nodes will be removed > > > > automatically. This somehow treats a pinned bpf object as a persistent > > > > "device". > > > > > > > We rely on fsnotify to monitor the inode events in bpffs. A new > > function > > > > bpf_watch_inode() is introduced. It allows registering a callback > > > > function at inode destruction. For the kernfs case, a callback that > > > > removes kernfs node is registered at the destruction of bpffs inodes. > > > > For other file systems such as sockfs, bpf_watch_inode() can monitor > > the > > > > destruction of sockfs inodes and the created file entry can hold the > > bpf > > > > object's reference. In this case, it is truly "pinning". > > > > > > > File operations other than seq_show can also be implemented using bpf. > > > > For example, bpf may be of help for .poll and .mmap in kernfs. > > > > > > This looks awesome! > > > > > > One thing I don't understand is: why did go through the pinning > > > interface VS regular attach/detach? IOW, why not allow regular > > > sys_bpf(BPF_PROG_ATTACH, prog_id, cgroup_id) and attach to the cgroup > > > (which, in turn, creates the kernfs nodes). Seems like this way you can > > drop > > > the requirement on the object being pinned in the bpffs first? > > > Thanks Stan. > > > Yeah, the attach/detach approach is definitely another option. IIUC, > > in comparison to pinning, does attach/detach only work for cgroups? > > attach has target_fd argument that, in theory, can be whatever. We can > add support for different fd types. > I see. With attach API, are we also able to specify some attributes for the attachment? For example, a property that we may want is: let descendent cgroups inherit their parent cgroup's programs. > > Pinning may be used on other file systems, sockfs, sysfs or resctrl. > > But I don't know whether this generality is welcome and implementing > > seq_show is the only concrete use case I can think of right now. If > > people think the ability of creating files in other subsystems is not > > good, I'd be happy to take a look at the attach/detach approach and > > that may be the right way. > > The reason I started thinking about attach/detach is because of clunky > unlink that you have to do (aka echo "rm" > file). IMO, having standard > attach/detach is a much more clear. But I might be missing some > complexity associated with non-cgroup filesystems. Oh, I see. Looks good. Let me play with it before sending the next version.