On Fri, Mar 27, 2020 at 4:07 AM Lorenz Bauer <lmb@xxxxxxxxxxxxxx> wrote: > > On Thu, 26 Mar 2020 at 19:06, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > > > On Thu, Mar 26, 2020 at 5:35 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > > > > > > Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: > > > > > > > Now for XDP. It has same flawed model. And even if it seems to you > > > > that it's not a big issue, and even if Jakub thinks we are trying to > > > > solve non-existing problem, it is a real problem and a real concern > > > > from people that have to support XDP in production with many > > > > well-meaning developers developing BPF applications independently. > > > > Copying what you wrote in another thread: > > > > > > > >> Setting aside the question of which is the best abstraction to represent > > > >> an attachment, it seems to me that the actual behavioural problem (XDP > > > >> programs being overridden by mistake) would be solvable by this patch, > > > >> assuming well-behaved userspace applications. > > > > > > > > ... this is a horrible and unrealistic assumption that we just cannot > > > > make and accept. However well-behaved userspace applications are, they > > > > are written by people that make mistakes. And rather than blissfully > > > > expect that everything will be fine, we want to have enforcements in > > > > place that will prevent some buggy application to wreck havoc in > > > > production. > > > > > > Look, I'm not trying to tell you how to managed your internal systems. > > > I'm just objecting to your assertion that your deployment model is the > > > only one that can possibly work, and the refusal to consider other > > > alternatives that comes with it. > > > > Your assumption doesn't work for us. Because of that we need something > > like bpf_link. Existing attachment API doesn't go away and is still > > supported. Feel free to use existing API. As for EXPECTED_FD API you > > are adding, it will be up to maintainers to decide, ultimately, I > > can't block it, even if I wanted to. > > > > > > > > >> You're saying that like we didn't already have the netlink API. We > > > >> essentially already have (the equivalent of) LINK_CREATE and LINK_QUERY, > > > >> this is just adding LINK_UPDATE. It's a straight-forward fix of an > > > >> existing API; essentially you're saying we should keep the old API in a > > > >> crippled state in order to promote your (proposed) new API. > > > > > > > > This is the fundamental disagreement that we seem to have. XDP's BPF > > > > program attachment is not in any way equivalent to bpf_link. So no, > > > > netlink API currently doesn't have anything that's close to bpf_link. > > > > Let me try to summarize what bpf_link is and what are its fundamental > > > > properties regardless of type of BPF programs. > > > > > > First of all, thank you for this summary; that is very useful! > > > > Sure, you're welcome. > > > > > > > > > 1. bpf_link represents a connection (pairing?) of BPF program and some > > > > BPF hook it is attached to. BPF hook could be perf event, cgroup, > > > > netdev, etc. It's a completely independent object in itself, along the > > > > bpf_map and bpf_prog, which has its own lifetime and kernel > > > > representation. To user-space application it is returned as an > > > > installed FD, similar to loaded BPF program and BPF map. It is > > > > important that it's not just a BPF program, because BPF program can be > > > > attached to multiple BPF hooks (e.g., same XDP program can be attached > > > > to multiple interface; same kprobe handler can be installed multiple > > > > times), which means that having BPF program FD isn't enough to > > > > uniquely represent that one specific BPF program attachment and detach > > > > it or query it. Having kernel object for this allows to encapsulate > > > > all these various details of what is attached were and present to > > > > user-space a single handle (FD) to work with. > > > > > > For XDP there is already a unique handle, it's just implicit: Each > > > netdev can have exactly one XDP program loaded. So I don't really see > > > how bpf_link adds anything, other than another API for the same thing? > > > > I certainly failed to explain things clearly if you are still asking > > this. See point #2, once you attach bpf_link you can't just replace > > it. This is what XDP doesn't have right now. > > From your description I like bpf_link, because it'll make attachment easier > to support, and the pinning behaviour also seems nice. I'm really not fussed > by netlink vs syscall, whatever. Great, thanks. > > However, this behaviour concerns me. It's like Windows not > letting you delete a file while an application has it opened, which just leads > to randomly killing programs until you find the right one. It's frustrating > and counter productive. > > You're taking power away from the operator. In your deployment scenario > this might make sense, but I think it's a really bad model in general. If I am > privileged I need to be able to exercise that privilege. This means that if > there is a netdevice in my network namespace, and I have CAP_NET_ADMIN > or whatever, I can break the association. > > So, to be constructive: I'd prefer bpf_link to replace a netlink attachment and > vice versa. If you need to restrict control, use network namespaces > to hide the devices, instead of hiding the bpffs. Alexei mentioned a "nuke" option few times already, that will solve this. The idea is that human operation should be able to do this, but not applications, even though they have CAP_NET_ADMIN. I don't know how exactly interface will look like, but it shouldn't allow applications just randomly replace bpf_link. There are legitimate use cases where application has to have CAP_NET_ADMIN and we can't hide netdevice from them, unfortunately. > > > > > It's a game of picking features/properties in isolation and "we can do > > this particular thing this different way with what we have". Please, > > try consider all of it together, it's important. Every single aspect > > of bpf_link is not that unique, but it's all of them together that > > matter. > > > > > > > > > 2. Due to having FD associated with bpf_link, it's not possible to > > > > talk about "owning" bpf_link. If application created link and never > > > > shared its FD with any other application, it is the sole owner of it. > > > > But it also means that you can share it, if you need it. Now, once > > > > application closes FD or app crashes and kernel automatically closes > > > > that FD, bpf_link refcount is decremented. If it was the last or only > > > > FD, it will trigger automatica detachment and clean up of that > > > > particular BPF program attachment. Note, not a clean up of BPF > > > > program, which can still be attached somewhere else: only that > > > > particular attachment. > > > > > > This behaviour is actually one of my reservations against bpf_link for > > > XDP: I think that automatically detaching XDP programs when the FD is > > > closed is very much the wrong behaviour. An XDP program processes > > > packets, and when loading one I very much expect it to keep doing that > > > until I explicitly tell it to stop. > > > > As you mentioned earlier, "it's not the only one mode". Just like with > > tracing APIs, you can imagine scripts that would adds their > > packet-sniffing XDP program temporarily. If they crash, "temporarily" > > turns into "permanently, but no one knows". This is bad. And again, > > it's a choice, just with a default to auto-cleanup, because it's safe, > > even if it requires extra step for applications willing to do > > permanent XDP attachment. > > > > > > > > > 3. This derives from the concept of ownership of bpf_link. Once > > > > bpf_link is attached, no other application that doesn't own that > > > > bpf_link can replace, detach or modify the link. For some cases it > > > > doesn't matter. E.g., for tracing, all attachment to the same fentry > > > > trampoline are completely independent. But for other cases this is > > > > crucial property. E.g., when you attach BPF program in an exclusive > > > > (single) mode, it means that particular cgroup and any of its children > > > > cgroups can have any more BPF programs attached. This is important for > > > > container management systems to enforce invariants and correct > > > > functioning of the system. Right now it's very easy to violate that - > > > > you just go and attach your own BPF program, and previous BPF program > > > > gets automatically detached without original application that put it > > > > there knowing about this. Chaos ensues after that and real people have > > > > to deal with this. Which is why existing > > > > BPF_PROG_ATTACH/BPF_PROG_DETACH API is inadequate and we are adding > > > > bpf_link support. > > > > > > I can totally see how having an option to enforce a policy such as > > > locking out others from installing cgroup BPF programs is useful. But > > > such an option is just that: policy. So building this policy in as a > > > fundamental property of the API seems like a bad idea; that is > > > effectively enforcing policy in the kernel, isn't it? > > > > I hope we won't go into a dictionary definition of what "policy" means > > here :). For me it's about guarantee that kernel gives to user-space. > > bpf_link doesn't care about dictating policies. If you don't want this > > guarantee - don't use bpf_link, use direct program attachment. As > > simple as that. Policy is implemented by user-space application by > > using APIs with just the right guarantees. > > > > > > > > > Those same folks have similar concern with XDP. In the world where > > > > container management installs "root" XDP program which other user > > > > applications can plug into (libxdp use case, right?), it's crucial to > > > > ensure that this root XDP program is not accidentally overwritten by > > > > some well-meaning, but not overly cautious developer experimenting in > > > > his own container with XDP programs. This is where bpf_link ownership > > > > plays a huge role. Tupperware agent (FB's container management agent) > > > > would install root XDP program and will hold onto this bpf_link > > > > without sharing it with other applications. That will guarantee that > > > > the system will be stable and can't be compromised. > > > > > > See this is where we get into "deployment-model specific territory". I > > > mean, sure, in the "central management daemon" model, it makes sense > > > that no other applications can replace the XDP program. But, erm, we > > > already have a mechanism to ensure that: Just don't grant those > > > applications CAP_NET_ADMIN? So again, bpf_link doesn't really seem to > > > add anything other than a different way to do the same thing? > > > > Because there are still applications that need CAP_NET_ADMIN in order > > to function (for other reasons than attaching XDP), so it's impossible > > to enforce with for everyone. > > I think I'm missing some context. CAP_NET_ADMIN is trusted by definition, > so trust these applications to not fiddle with XDP? Are there many of these? > Are they inside a user namespace or something? > > > > > > > > > Additionally, in the case where there is *not* a central management > > > daemon (i.e., what I'm implementing with libxdp), this would be the flow > > > implemented by the library without bpf_link: > > > > > > 1. Query kernel for current BPF prog loaded on $IFACE > > > 2. Sanity-check that this program is a dispatcher program installed by > > > libxdp > > > 3. Create a new dispatcher program with whatever changes we want to do > > > (such as adding another component program). > > > 4. Atomically replace the old program with the new one using the netlink > > > API in this patch series. > > > > > > Whereas with bpf_link, it would be: > > > > > > 1. Find the pinned bpf_link for $IFACE (e.g., load from > > > /sys/fs/bpf/iface-links/$IFNAME). > > > > But now you can hide this mount point from containerized > > root/CAP_NET_ADMIN application, can't you? See the difference? One > > might think about bpf_link as a fine-grained capability in this sense. > > > > > > > 2. Query kernel for current BPF prog linked to $LINK > > > 3. Sanity-check that this program is a dispatcher program installed by > > > libxdp > > > 4. Create a new dispatcher program with whatever changes we want to do > > > (such as adding another component program). > > > 5. Atomically replace the old program with the new one using the > > > LINK_UPDATE bpf() API. > > > > > > > > > So all this does is add an additional step, and another dependency on > > > bpffs. And crucially, I really don't see how the "bpf_link is the only > > > thing that is not fundamentally broken" argument holds up. > > > > > > -Toke > > > > > > > -- > Lorenz Bauer | Systems Engineer > 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK > > www.cloudflare.com