On Thu, Mar 26, 2020 at 5:35 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > > Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: > > > Now for XDP. It has same flawed model. And even if it seems to you > > that it's not a big issue, and even if Jakub thinks we are trying to > > solve non-existing problem, it is a real problem and a real concern > > from people that have to support XDP in production with many > > well-meaning developers developing BPF applications independently. > > Copying what you wrote in another thread: > > > >> Setting aside the question of which is the best abstraction to represent > >> an attachment, it seems to me that the actual behavioural problem (XDP > >> programs being overridden by mistake) would be solvable by this patch, > >> assuming well-behaved userspace applications. > > > > ... this is a horrible and unrealistic assumption that we just cannot > > make and accept. However well-behaved userspace applications are, they > > are written by people that make mistakes. And rather than blissfully > > expect that everything will be fine, we want to have enforcements in > > place that will prevent some buggy application to wreck havoc in > > production. > > Look, I'm not trying to tell you how to managed your internal systems. > I'm just objecting to your assertion that your deployment model is the > only one that can possibly work, and the refusal to consider other > alternatives that comes with it. Your assumption doesn't work for us. Because of that we need something like bpf_link. Existing attachment API doesn't go away and is still supported. Feel free to use existing API. As for EXPECTED_FD API you are adding, it will be up to maintainers to decide, ultimately, I can't block it, even if I wanted to. > > >> You're saying that like we didn't already have the netlink API. We > >> essentially already have (the equivalent of) LINK_CREATE and LINK_QUERY, > >> this is just adding LINK_UPDATE. It's a straight-forward fix of an > >> existing API; essentially you're saying we should keep the old API in a > >> crippled state in order to promote your (proposed) new API. > > > > This is the fundamental disagreement that we seem to have. XDP's BPF > > program attachment is not in any way equivalent to bpf_link. So no, > > netlink API currently doesn't have anything that's close to bpf_link. > > Let me try to summarize what bpf_link is and what are its fundamental > > properties regardless of type of BPF programs. > > First of all, thank you for this summary; that is very useful! Sure, you're welcome. > > > 1. bpf_link represents a connection (pairing?) of BPF program and some > > BPF hook it is attached to. BPF hook could be perf event, cgroup, > > netdev, etc. It's a completely independent object in itself, along the > > bpf_map and bpf_prog, which has its own lifetime and kernel > > representation. To user-space application it is returned as an > > installed FD, similar to loaded BPF program and BPF map. It is > > important that it's not just a BPF program, because BPF program can be > > attached to multiple BPF hooks (e.g., same XDP program can be attached > > to multiple interface; same kprobe handler can be installed multiple > > times), which means that having BPF program FD isn't enough to > > uniquely represent that one specific BPF program attachment and detach > > it or query it. Having kernel object for this allows to encapsulate > > all these various details of what is attached were and present to > > user-space a single handle (FD) to work with. > > For XDP there is already a unique handle, it's just implicit: Each > netdev can have exactly one XDP program loaded. So I don't really see > how bpf_link adds anything, other than another API for the same thing? I certainly failed to explain things clearly if you are still asking this. See point #2, once you attach bpf_link you can't just replace it. This is what XDP doesn't have right now. It's a game of picking features/properties in isolation and "we can do this particular thing this different way with what we have". Please, try consider all of it together, it's important. Every single aspect of bpf_link is not that unique, but it's all of them together that matter. > > > 2. Due to having FD associated with bpf_link, it's not possible to > > talk about "owning" bpf_link. If application created link and never > > shared its FD with any other application, it is the sole owner of it. > > But it also means that you can share it, if you need it. Now, once > > application closes FD or app crashes and kernel automatically closes > > that FD, bpf_link refcount is decremented. If it was the last or only > > FD, it will trigger automatica detachment and clean up of that > > particular BPF program attachment. Note, not a clean up of BPF > > program, which can still be attached somewhere else: only that > > particular attachment. > > This behaviour is actually one of my reservations against bpf_link for > XDP: I think that automatically detaching XDP programs when the FD is > closed is very much the wrong behaviour. An XDP program processes > packets, and when loading one I very much expect it to keep doing that > until I explicitly tell it to stop. As you mentioned earlier, "it's not the only one mode". Just like with tracing APIs, you can imagine scripts that would adds their packet-sniffing XDP program temporarily. If they crash, "temporarily" turns into "permanently, but no one knows". This is bad. And again, it's a choice, just with a default to auto-cleanup, because it's safe, even if it requires extra step for applications willing to do permanent XDP attachment. > > > 3. This derives from the concept of ownership of bpf_link. Once > > bpf_link is attached, no other application that doesn't own that > > bpf_link can replace, detach or modify the link. For some cases it > > doesn't matter. E.g., for tracing, all attachment to the same fentry > > trampoline are completely independent. But for other cases this is > > crucial property. E.g., when you attach BPF program in an exclusive > > (single) mode, it means that particular cgroup and any of its children > > cgroups can have any more BPF programs attached. This is important for > > container management systems to enforce invariants and correct > > functioning of the system. Right now it's very easy to violate that - > > you just go and attach your own BPF program, and previous BPF program > > gets automatically detached without original application that put it > > there knowing about this. Chaos ensues after that and real people have > > to deal with this. Which is why existing > > BPF_PROG_ATTACH/BPF_PROG_DETACH API is inadequate and we are adding > > bpf_link support. > > I can totally see how having an option to enforce a policy such as > locking out others from installing cgroup BPF programs is useful. But > such an option is just that: policy. So building this policy in as a > fundamental property of the API seems like a bad idea; that is > effectively enforcing policy in the kernel, isn't it? I hope we won't go into a dictionary definition of what "policy" means here :). For me it's about guarantee that kernel gives to user-space. bpf_link doesn't care about dictating policies. If you don't want this guarantee - don't use bpf_link, use direct program attachment. As simple as that. Policy is implemented by user-space application by using APIs with just the right guarantees. > > > Those same folks have similar concern with XDP. In the world where > > container management installs "root" XDP program which other user > > applications can plug into (libxdp use case, right?), it's crucial to > > ensure that this root XDP program is not accidentally overwritten by > > some well-meaning, but not overly cautious developer experimenting in > > his own container with XDP programs. This is where bpf_link ownership > > plays a huge role. Tupperware agent (FB's container management agent) > > would install root XDP program and will hold onto this bpf_link > > without sharing it with other applications. That will guarantee that > > the system will be stable and can't be compromised. > > See this is where we get into "deployment-model specific territory". I > mean, sure, in the "central management daemon" model, it makes sense > that no other applications can replace the XDP program. But, erm, we > already have a mechanism to ensure that: Just don't grant those > applications CAP_NET_ADMIN? So again, bpf_link doesn't really seem to > add anything other than a different way to do the same thing? Because there are still applications that need CAP_NET_ADMIN in order to function (for other reasons than attaching XDP), so it's impossible to enforce with for everyone. > > Additionally, in the case where there is *not* a central management > daemon (i.e., what I'm implementing with libxdp), this would be the flow > implemented by the library without bpf_link: > > 1. Query kernel for current BPF prog loaded on $IFACE > 2. Sanity-check that this program is a dispatcher program installed by > libxdp > 3. Create a new dispatcher program with whatever changes we want to do > (such as adding another component program). > 4. Atomically replace the old program with the new one using the netlink > API in this patch series. > > Whereas with bpf_link, it would be: > > 1. Find the pinned bpf_link for $IFACE (e.g., load from > /sys/fs/bpf/iface-links/$IFNAME). But now you can hide this mount point from containerized root/CAP_NET_ADMIN application, can't you? See the difference? One might think about bpf_link as a fine-grained capability in this sense. > 2. Query kernel for current BPF prog linked to $LINK > 3. Sanity-check that this program is a dispatcher program installed by > libxdp > 4. Create a new dispatcher program with whatever changes we want to do > (such as adding another component program). > 5. Atomically replace the old program with the new one using the > LINK_UPDATE bpf() API. > > > So all this does is add an additional step, and another dependency on > bpffs. And crucially, I really don't see how the "bpf_link is the only > thing that is not fundamentally broken" argument holds up. > > -Toke >