On Tue, Mar 31, 2020 at 07:42:46PM -0600, David Ahern wrote: > On 3/30/20 7:17 PM, Alexei Starovoitov wrote: > > On Mon, Mar 30, 2020 at 06:57:44PM -0600, David Ahern wrote: > >> On 3/30/20 6:32 PM, Alexei Starovoitov wrote: > >>>> > >>>> This is not a large feature, and there is no reason for CREATE/UPDATE - > >>>> a mere 4 patch set - to go in without something as essential as the > >>>> QUERY for observability. > >>> > >>> As I said 'bpftool cgroup' covers it. Observability is not reduced in any way. > >> > >> You want a feature where a process can prevent another from installing a > >> program on a cgroup. How do I learn which process is holding the > >> bpf_link reference and preventing me from installing a program? Unless I > >> have missed some recent change that is not currently covered by bpftool > >> cgroup, and there is no way reading kernel code will tell me. > > > > No. That's not the case at all. You misunderstood the concept. > > I don't think so ... > > > > >> That is my point. You are restricting what root can do and people will > >> not want to resort to killing random processes trying to find the one > >> holding a reference. > > > > Not true either. > > bpf_link = old attach with allow_multi (but with extra safety for owner) > > cgroup programs existed for roughly 1 year before BPF_F_ALLOW_MULTI. > That's a year for tools like 'ip vrf exec' to exist and be relied on. > 'ip vrf exec' does not use MULTI. > > I have not done a deep dive on systemd code, but on ubuntu 18.04 system: > > $ sudo ~/bin/bpftool cgroup tree > CgroupPath > ID AttachType AttachFlags Name > /sys/fs/cgroup/unified/system.slice/systemd-udevd.service > 5 ingress > 4 egress > /sys/fs/cgroup/unified/system.slice/systemd-journald.service > 3 ingress > 2 egress > /sys/fs/cgroup/unified/system.slice/systemd-logind.service > 7 ingress > 6 egress > > suggests that multi is not common with systemd either at some point in > its path, so 'ip vrf exec' is not alone in not using the flag. There > most likely are many other tools. Please take a look at systemd source code: src/core/bpf-devices.c src/core/bpf-firewall.c It prefers to use BPF_F_ALLOW_MULTI when possible. Since it's the most sensible flag. Since 'ip vrf exec' is not using allow_multi it's breaking several systemd features. (regardless of what bpf_link can and cannot do) > > The only thing bpf_link protects is the owner of the link from other > > processes of nuking that link. > > It does _not_ prevent other processes attaching their own cgroup-bpf progs > > either via old interface or via bpf_link. > > > > It does when that older code does not use the MULTI flag. There is a > history that is going to create conflicts and being able to id which > program holds the bpf_link is essential. > > And this is really just one use case. There are many other reasons for > wanting to know what process is holding a reference to something. I'm not disagreeing that it's useful to query what is attached where. My point once again that bpf_link for cgroup didn't change a single bit in this logic. There are processes (like systemd) that are using allow_multi. When they switch to use bpf_link few years from now nothing will change for all other processes in the system. Only systemd will be assured that their bpf-device prog will not be accidentally removed by 'ip vrf'. Currently nothing protects systemd's bpf progs. Any cap_net_admin process can _accidentally_ nuke it. It's even more weird that bpf-cgroup-device that systemd is using is under cap_net_admin. There is nothing networking about it. But that's a separate discussion. May be you should fix 'ip vrf' first before systemd folks start yelling and then we can continue arguing about merits of observability?