Re: [PATCH bpf-next 1/4] xdp: Support specifying expected existing program when attaching XDP

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 27 Mar 2020 12:42:20 -0700

On Fri, Mar 27, 2020 at 4:07 AM Lorenz Bauer <lmb@xxxxxxxxxxxxxx> wrote:
>
> On Thu, 26 Mar 2020 at 19:06, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Thu, Mar 26, 2020 at 5:35 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> > >
> > > Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:
> > >
> > > > Now for XDP. It has same flawed model. And even if it seems to you
> > > > that it's not a big issue, and even if Jakub thinks we are trying to
> > > > solve non-existing problem, it is a real problem and a real concern
> > > > from people that have to support XDP in production with many
> > > > well-meaning developers developing BPF applications independently.
> > > > Copying what you wrote in another thread:
> > > >
> > > >> Setting aside the question of which is the best abstraction to represent
> > > >> an attachment, it seems to me that the actual behavioural problem (XDP
> > > >> programs being overridden by mistake) would be solvable by this patch,
> > > >> assuming well-behaved userspace applications.
> > > >
> > > > ... this is a horrible and unrealistic assumption that we just cannot
> > > > make and accept. However well-behaved userspace applications are, they
> > > > are written by people that make mistakes. And rather than blissfully
> > > > expect that everything will be fine, we want to have enforcements in
> > > > place that will prevent some buggy application to wreck havoc in
> > > > production.
> > >
> > > Look, I'm not trying to tell you how to managed your internal systems.
> > > I'm just objecting to your assertion that your deployment model is the
> > > only one that can possibly work, and the refusal to consider other
> > > alternatives that comes with it.
> >
> > Your assumption doesn't work for us. Because of that we need something
> > like bpf_link. Existing attachment API doesn't go away and is still
> > supported. Feel free to use existing API. As for EXPECTED_FD API you
> > are adding, it will be up to maintainers to decide, ultimately, I
> > can't block it, even if I wanted to.
> >
> > >
> > > >> You're saying that like we didn't already have the netlink API. We
> > > >> essentially already have (the equivalent of) LINK_CREATE and LINK_QUERY,
> > > >> this is just adding LINK_UPDATE. It's a straight-forward fix of an
> > > >> existing API; essentially you're saying we should keep the old API in a
> > > >> crippled state in order to promote your (proposed) new API.
> > > >
> > > > This is the fundamental disagreement that we seem to have. XDP's BPF
> > > > program attachment is not in any way equivalent to bpf_link. So no,
> > > > netlink API currently doesn't have anything that's close to bpf_link.
> > > > Let me try to summarize what bpf_link is and what are its fundamental
> > > > properties regardless of type of BPF programs.
> > >
> > > First of all, thank you for this summary; that is very useful!
> >
> > Sure, you're welcome.
> >
> > >
> > > > 1. bpf_link represents a connection (pairing?) of BPF program and some
> > > > BPF hook it is attached to. BPF hook could be perf event, cgroup,
> > > > netdev, etc. It's a completely independent object in itself, along the
> > > > bpf_map and bpf_prog, which has its own lifetime and kernel
> > > > representation. To user-space application it is returned as an
> > > > installed FD, similar to loaded BPF program and BPF map. It is
> > > > important that it's not just a BPF program, because BPF program can be
> > > > attached to multiple BPF hooks (e.g., same XDP program can be attached
> > > > to multiple interface; same kprobe handler can be installed multiple
> > > > times), which means that having BPF program FD isn't enough to
> > > > uniquely represent that one specific BPF program attachment and detach
> > > > it or query it. Having kernel object for this allows to encapsulate
> > > > all these various details of what is attached were and present to
> > > > user-space a single handle (FD) to work with.
> > >
> > > For XDP there is already a unique handle, it's just implicit: Each
> > > netdev can have exactly one XDP program loaded. So I don't really see
> > > how bpf_link adds anything, other than another API for the same thing?
> >
> > I certainly failed to explain things clearly if you are still asking
> > this. See point #2, once you attach bpf_link you can't just replace
> > it. This is what XDP doesn't have right now.
>
> From your description I like bpf_link, because it'll make attachment easier
> to support, and the pinning behaviour also seems nice. I'm really not fussed
> by netlink vs syscall, whatever.

Great, thanks.

>
> However, this behaviour concerns me. It's like Windows not
> letting you delete a file while an application has it opened, which just leads
> to randomly killing programs until you find the right one. It's frustrating
> and counter productive.
>
> You're taking power away from the operator. In your deployment scenario
> this might make sense, but I think it's a really bad model in general. If I am
> privileged I need to be able to exercise that privilege. This means that if
> there is a netdevice in my network namespace, and I have CAP_NET_ADMIN
> or whatever, I can break the association.
>
> So, to be constructive: I'd prefer bpf_link to replace a netlink attachment and
> vice versa. If you need to restrict control, use network namespaces
> to hide the devices, instead of hiding the bpffs.

Alexei mentioned a "nuke" option few times already, that will solve
this. The idea is that human operation should be able to do this, but
not applications, even though they have CAP_NET_ADMIN. I don't know
how exactly interface will look like, but it shouldn't allow
applications just randomly replace bpf_link. There are legitimate use
cases where application has to have CAP_NET_ADMIN and we can't hide
netdevice from them, unfortunately.

>
> >
> > It's a game of picking features/properties in isolation and "we can do
> > this particular thing this different way with what we have". Please,
> > try consider all of it together, it's important. Every single aspect
> > of bpf_link is not that unique, but it's all of them together that
> > matter.
> >
> > >
> > > > 2. Due to having FD associated with bpf_link, it's not possible to
> > > > talk about "owning" bpf_link. If application created link and never
> > > > shared its FD with any other application, it is the sole owner of it.
> > > > But it also means that you can share it, if you need it. Now, once
> > > > application closes FD or app crashes and kernel automatically closes
> > > > that FD, bpf_link refcount is decremented. If it was the last or only
> > > > FD, it will trigger automatica detachment and clean up of that
> > > > particular BPF program attachment. Note, not a clean up of BPF
> > > > program, which can still be attached somewhere else: only that
> > > > particular attachment.
> > >
> > > This behaviour is actually one of my reservations against bpf_link for
> > > XDP: I think that automatically detaching XDP programs when the FD is
> > > closed is very much the wrong behaviour. An XDP program processes
> > > packets, and when loading one I very much expect it to keep doing that
> > > until I explicitly tell it to stop.
> >
> > As you mentioned earlier, "it's not the only one mode". Just like with
> > tracing APIs, you can imagine scripts that would adds their
> > packet-sniffing XDP program temporarily. If they crash, "temporarily"
> > turns into "permanently, but no one knows". This is bad. And again,
> > it's a choice, just with a default to auto-cleanup, because it's safe,
> > even if it requires extra step for applications willing to do
> > permanent XDP attachment.
> >
> > >
> > > > 3. This derives from the concept of ownership of bpf_link. Once
> > > > bpf_link is attached, no other application that doesn't own that
> > > > bpf_link can replace, detach or modify the link. For some cases it
> > > > doesn't matter. E.g., for tracing, all attachment to the same fentry
> > > > trampoline are completely independent. But for other cases this is
> > > > crucial property. E.g., when you attach BPF program in an exclusive
> > > > (single) mode, it means that particular cgroup and any of its children
> > > > cgroups can have any more BPF programs attached. This is important for
> > > > container management systems to enforce invariants and correct
> > > > functioning of the system. Right now it's very easy to violate that -
> > > > you just go and attach your own BPF program, and previous BPF program
> > > > gets automatically detached without original application that put it
> > > > there knowing about this. Chaos ensues after that and real people have
> > > > to deal with this. Which is why existing
> > > > BPF_PROG_ATTACH/BPF_PROG_DETACH API is inadequate and we are adding
> > > > bpf_link support.
> > >
> > > I can totally see how having an option to enforce a policy such as
> > > locking out others from installing cgroup BPF programs is useful. But
> > > such an option is just that: policy. So building this policy in as a
> > > fundamental property of the API seems like a bad idea; that is
> > > effectively enforcing policy in the kernel, isn't it?
> >
> > I hope we won't go into a dictionary definition of what "policy" means
> > here :). For me it's about guarantee that kernel gives to user-space.
> > bpf_link doesn't care about dictating policies. If you don't want this
> > guarantee - don't use bpf_link, use direct program attachment. As
> > simple as that. Policy is implemented by user-space application by
> > using APIs with just the right guarantees.
> >
> > >
> > > > Those same folks have similar concern with XDP. In the world where
> > > > container management installs "root" XDP program which other user
> > > > applications can plug into (libxdp use case, right?), it's crucial to
> > > > ensure that this root XDP program is not accidentally overwritten by
> > > > some well-meaning, but not overly cautious developer experimenting in
> > > > his own container with XDP programs. This is where bpf_link ownership
> > > > plays a huge role. Tupperware agent (FB's container management agent)
> > > > would install root XDP program and will hold onto this bpf_link
> > > > without sharing it with other applications. That will guarantee that
> > > > the system will be stable and can't be compromised.
> > >
> > > See this is where we get into "deployment-model specific territory". I
> > > mean, sure, in the "central management daemon" model, it makes sense
> > > that no other applications can replace the XDP program. But, erm, we
> > > already have a mechanism to ensure that: Just don't grant those
> > > applications CAP_NET_ADMIN? So again, bpf_link doesn't really seem to
> > > add anything other than a different way to do the same thing?
> >
> > Because there are still applications that need CAP_NET_ADMIN in order
> > to function (for other reasons than attaching XDP), so it's impossible
> > to enforce with for everyone.
>
> I think I'm missing some context. CAP_NET_ADMIN is trusted by definition,
> so trust these applications to not fiddle with XDP? Are there many of these?
> Are they inside a user namespace or something?
>
> >
> > >
> > > Additionally, in the case where there is *not* a central management
> > > daemon (i.e., what I'm implementing with libxdp), this would be the flow
> > > implemented by the library without bpf_link:
> > >
> > > 1. Query kernel for current BPF prog loaded on $IFACE
> > > 2. Sanity-check that this program is a dispatcher program installed by
> > >    libxdp
> > > 3. Create a new dispatcher program with whatever changes we want to do
> > >    (such as adding another component program).
> > > 4. Atomically replace the old program with the new one using the netlink
> > >    API in this patch series.
> > >
> > > Whereas with bpf_link, it would be:
> > >
> > > 1. Find the pinned bpf_link for $IFACE (e.g., load from
> > >    /sys/fs/bpf/iface-links/$IFNAME).
> >
> > But now you can hide this mount point from containerized
> > root/CAP_NET_ADMIN application, can't you? See the difference? One
> > might think about bpf_link as a fine-grained capability in this sense.
> >
> >
> > > 2. Query kernel for current BPF prog linked to $LINK
> > > 3. Sanity-check that this program is a dispatcher program installed by
> > >    libxdp
> > > 4. Create a new dispatcher program with whatever changes we want to do
> > >    (such as adding another component program).
> > > 5. Atomically replace the old program with the new one using the
> > >    LINK_UPDATE bpf() API.
> > >
> > >
> > > So all this does is add an additional step, and another dependency on
> > > bpffs. And crucially, I really don't see how the "bpf_link is the only
> > > thing that is not fundamentally broken" argument holds up.
> > >
> > > -Toke
> > >
>
>
>
> --
> Lorenz Bauer  |  Systems Engineer
> 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK
>
> www.cloudflare.com