On Wed, Mar 25, 2020 at 10:13 PM Jakub Kicinski <kuba@xxxxxxxxxx> wrote: > > On Wed, 25 Mar 2020 17:16:13 -0700 Andrii Nakryiko wrote: > > > >> Well, I wasn't talking about any of those subsystems, I was talking > > > >> about networking :) > > > > > > > > So it's not "BPF subsystem's relation to the rest of the kernel" from > > > > your previous email, it's now only "talking about networking"? Since > > > > when the rest of the kernel is networking? > > > > > > Not really, I would likely argue the same for any other subsystem, I > > > > And you would like lose that argument :) You already agreed that for > > tracing this is not the case. BPF is not attached by writing text into > > ftrace's debugfs entries. Same for cgroups, we don't > > create/update/write special files in cgroupfs, we have an explicit > > attachment API in BPF. > > > > BTW, kprobes started out with the same model as XDP has right now. You > > had to do a bunch of magic writes into various debugfs files to attach > > BPF program. If user-space application crashed, kprobe stayed > > attached. This was horrible and led to many problems in real world > > production uses. So a completely different interface was created, > > allowing to do it through perf_event_open() and created anonymous > > inode for BPF program attachment. That allowed crashing program to > > auto-detach kprobe and not harm production use case. > > > > Now we are coming after cgroup BPF programs, which have similar issues > > and similar pains in production. cgroup BPF progs actually have extra > > problems: programs can user-space applications can accidentally > > replace a critical cgroup program and ruin the day for many folks that > > have to deal with production breakage after that. Which is why I'm > > implementing bpf_link with all its properties: to solve real pain and > > real problem. > > > > Now for XDP. It has same flawed model. And even if it seems to you > > that it's not a big issue, and even if Jakub thinks we are trying to > > solve non-existing problem, it is a real problem and a real concern > > from people that have to support XDP in production with many > > More than happy to talk to those folks, and see the tickets. We can certainly set up some meeting with Andrey and Takshak. > > Toke has actual user space code which needs his extension, and for > which "ownership" makes no difference as it would just be passed with > whoever touched the program last. As has been repeated time and time again, we cannot allow any random application to just go and replace XDP program. Same for cgroups. It's not a hypothetical problem, it has happened and it has caused problems. So just because Toke's prototype doesn't have any protection against this, doesn't mean it's how it will end up being. > > > well-meaning developers developing BPF applications independently. > > There is one single program which can be attached to the XDP hook, > the "everybody attaches their program model" does not apply. Yes, but you've followed all the XDP chaining discussion and freplace stuff up until now, right? There is going to be a single XDP root program, but other applications are going to plug in their freplace programs into it. And Tupperware wants to control XDP root program and not let anyone replace it, even though some program will need to have root access anyways. > > TW agent should just listen on netlink notifications to see if someone I'll leave it up to TW agent team to decide if that's a good idea. But please educate me. When some app replaces XDP program accidentally, how TW agent can make sure (by following netlink notifications) that **no** packet is intercepted and mis-routed by this wrong XDP program? Are there such guarantees by netlink notifications that listening application will be able to undo the operation in between two network packets? > replaced its program. cgroups have multi-attachment and no notifications Multi-attachment is not always appropriate, which is why Andrey Ignatov asked to support all modes (NONE, OVERRIDABLE, MULTI). But honestly I lost why this is relevant here. > (although not sure anyone was explicitly asking for links there, > either). Tupperware did. > > In production a no-op XDP program is likely to be attached from the > moment machine boots, to avoid traffic interruption and the risk of > something going wrong with the driver when switching between skb to > xdp datapath. And then the program is only replaced, not detached. Good, so there in no problem to pin it somewhere forever. > > Not to mention the fact that networking applications generally don't > want to remove their policy from the kernel when they crash :/ Yes, which is why bpf_link are trivially pinnable. bpf_link gives choice. What's there right now in XDP (program FD attachment) doesn't give a choice of auto-detaching on application crash for cases where it's appropriate (some relatively short-running XDP monitoring script, for example). > > > Now, those were fundamental things, but I'd like to touch on a "nice > > things we get with that". Having a proper kernel object representing > > single instance of attached BPF program to some other kernel object > > allows to build an uniform and consistent API around bpf_link with > > same semantics. We can do LINK_UPDATE and allow to atomically replace > > BPF program inside the established bpf_link. It's applicable to all > > types of BPF program attachment and can be done in a way that ensures > > no BPF program invocation is skipped while BPF programs are swapped > > (because at the lowest level it boils down to an atomic pointer swap). > > Of course not all bpf_links might have this support initially, but > > we'll establish a lot of common infrastructure which will make it > > simpler, faster and more reliable to add this functionality. > > XDP replace is already atomic, no packet will be passed without either > old or new program executed on it. Please re-read what I wrote again, entire thing. You are picking arbitrary pieces and considering them in isolation. It's either dishonest or you are missing the point. > > > And to wrap up. I agree, consistent API is not a goal in itself, as > > Jakub mentioned. But it is a worthy goal nevertheless, especially if > > it doesn't cost anything extra. It makes kernel developers lives > > Not sure how having two interfaces instead of one makes kernel > developer's life easier. There is no interface for bpf_link for XDP right now. But let's separate netlink vs bpf syscall discussion from bpf_link general discussion.