On Sat, Feb 5, 2022 at 8:29 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@xxxxxxxxxx> wrote: > > > > > > > In our use case, we can't ask the users who create cgroups to do the > > > > pinning. Pinning requires root privilege. In our use case, we have > > > > non-root users who can create cgroup directories and still want to > > > > read bpf stats. They can't do pinning by themselves. This is why > > > > inheritance is a requirement for us. With inheritance, they only need > > > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning > > > > operation is required. Patch 1-4 are needed to implement inheritance. > > > > > > > > It's also not a good idea in our use case to add a userspace > > > > privileged process to monitor cgroupfs operations and perform the > > > > pinning. It's more complex and has a higher maintenance cost and > > > > runtime overhead, compared to the solution of asking whoever makes > > > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in > > > > the data center that don't have the userspace process deployed, the > > > > stats will be unavailable, which is a no-no for some of our users. > > > > > > The commit log says that there will be a daemon that does that > > > monitoring of cgroupfs. And that daemon needs to mkdir > > > directories in bpffs when a new cgroup is created, no? > > > The kernel is only doing inheritance of bpf progs into > > > new dirs. I think that daemon can pin as well. > > > > > > The cgroup creation is typically managed by an agent like systemd. > > > Sounds like you have your own agent that creates cgroups? > > > If so it has to be privileged and it can mkdir in bpffs and pin too ? > > > > Ah, yes, we have our own daemon to manage cgroups. That daemon creates > > the top-level cgroup for each job to run inside. However, the job can > > create its own cgroups inside the top-level cgroup, for fine grained > > resource control. This doesn't go through the daemon. The job-created > > cgroups don't have the pinned objects and this is a no-no for our > > users. > > We can whitelist certain tracepoints to be sleepable and extend > tp_btf prog type to include everything from prog_type_syscall. > Such prog would attach to cgroup_mkdir and cgroup_release > and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs. > We can allow prog_type_syscall to do mkdir in bpffs as well. > > This feature could be useful for similar monitoring/introspection tasks. > We can write a program that would monitor bpf prog load/unload > and would pin an iterator prog that would show debug info about a prog. > Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs. > With this feature we can implement: > ls /sys/fs/bpf/all_progs.debug/ > and each loaded prog would have a corresponding file. > The file name would be a program name, for example. > cat /sys/fs/bpf/all_progs.debug/my_prog > would pretty print info about 'my_prog' bpf program. > > This way the kernfs/cgroupfs specific logic from patches 1-4 > will not be necessary. > > wdyt? Thanks Alexei. I gave it more thought in the last couple of days. Actually I think it's a good idea, more flexible. It gets rid of the need of a user space daemon for monitoring cgroup creation and destruction. We could monitor task creations and exits as well, so that we can export per-task information (e.g. task_vma_iter) more efficiently. A couple of thoughts when thinking about the details: - Regarding parameterized pinning, I don't think we can have one single bpf_iter_link object, but with different parameters. Because parameters are part of the bpf_iter_link (bpf_iter_aux_info). So every time we pin, we have to attach iter in order to get a new link object first. So we need to add attach and detach in bpf_sys_bpf(). - We also need to add those syscalls for cleanup: (1) unlink for removing pinned obj and (2) rmdir for removing the directory in prog_type_syscall. With these extensions, we can shift some of the bpf operations currently performed in system daemons into the kernel. IMHO it's a great thing, making system monitoring more flexible.