Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/3/20 9:24 PM, Toke Høiland-Jørgensen wrote:
Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:
On Tue, Mar 3, 2020 at 11:23 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
On 3/3/20 4:46 PM, Alexei Starovoitov wrote:
On 3/3/20 12:12 AM, Daniel Borkmann wrote:

I can see the motivation for this abstraction in particular for tracing, but given
the goal of bpf_link is to formalize and make the various program attachment types
more uniform, how is this going to solve e.g. the tc/BPF case? There is no guarantee
that while you create a link with the prog attached to cls_bpf that someone else is
going to replace that qdisc underneath you, and hence you end up with the same case
as if you would have only pinned the program itself (and not a link). So bpf_link
then gives a wrong impression that something is still attached and active while it
is not. What is the plan for these types?

TC is not easy to handle, right, but I don't see a 'wrong impression' part. The link will keep the program attached to qdisc. The admin
may try to remove qdisc for netdev, but that's a separate issue.
Same thing with xdp. The link will keep xdp program attached,
but admin may do ifconfig down and no packets will be flowing.
Similar with cgroups. The link will keep prog attached to a cgroup,
but admin can still do rmdir and cgroup will be in 'dying' state.
In case of tracing there is no intermediate entity between programs
and the kernel. In case of networking there are layers.
Netdevs, qdiscs, etc. May be dev_hold is a way to go.

Yep, right. I mean taking tracing use-case aside, in Cilium we attach to XDP, tc,
cgroups BPF and whatnot, and we can tear down the Cilium user space agent just
fine while packets keep flowing through the BPF progs, and a later restart will
just reattach them atomically, e.g. Cilium version upgrades are usually done this
way.

Right. This is the case where you want attached BPF program to survive
control application process exiting. Which is not a safe default,
though, because it might lead to BPF program running without anyone
knowing, leading to really bad consequences. It's especially important
for applications that are deployed fleet-wide and that don't "control"
hosts they are deployed to. If such application crashes and no one
notices and does anything about that, BPF program will keep running
draining resources or even just, say, dropping packets. We at FB had
outages due to such permanent BPF attachment semantics. With FD-based

I think it depends on the environment, and yes, whether the orchestrator
of those progs controls the host [networking] as in case of Cilium. We
actually had cases where a large user in prod was accidentally removing
the Cilium k8s daemon set (and hence the user space agent as well) and only
noticed 1hrs later since everything just kept running in the data path as
expected w/o causing them an outage. So I think both attachment semantics
have pros and cons. ;)

bpf_link we are getting a framework, which allows safe,
auto-detachable behavior by default, unless application explicitly
opts in w/ bpf_link__pin().

This decoupling works since the attach point is already holding the reference on
the program, and if needed user space can always retrieve what has been attached
there. So the surrounding object acts like the "bpf_link" already. I think we need
to figure out what semantics an actual bpf_link should have there. Given an admin
can change qdisc/netdev/etc underneath us, and hence cause implicit detachment, I
don't know whether it would make much sense to keep surrounding objects like filter,
qdisc or even netdev alive to work around it since there's a whole dependency chain,
like in case of filter instance, it would be kept alive, but surrounding qdisc may
be dropped.

I don't have specific enough knowledge right now to answer tc/BPF
question, but it seems like attached BPF program should hold a
reference to whatever it's attached to (net_device or whatnot) and not
let it just disappear? E.g., for cgroups, cgroup will go into dying

But then are you also expecting that netlink requests which drop that tc
filter that holds this BPF prog would get rejected given it has a bpf_link,
is active & pinned and traffic goes through? If not the case, then what
would be the point? If it is the case, then this seems rather complex to
realize for rather little gain given there are two uapi interfaces (bpf,
tc/netlink) which then mess around with the same underlying object in
different ways.

state, but it still will be there as long as there are remaining BPF
programs attached, sockets open, etc. I think it should be a general
approach, but again, I don't know specifics of each "attach point".

Question is, if there are no good semantics and benefits over what can be done
today with existing infra (abstracted from user space via libbpf) for the remaining
program types, perhaps it makes sense to have the pinning tracing specific only
instead of generic abstraction which only ever works for a limited number?

See above, I think bpf_link is what allows to have both
auto-detachment by default, as well as allow long-lived BPF
attachments (with explicit opt int).

As for what bpf_link can provide on top of existing stuff. One thing
that becomes more apparent with recent XDP discussions and what was
solved in cgroup-specific way for cgroup BPFs, is that there is a need
to swap BPF programs without interruption (BPF_F_REPLACE behavior for
cgroup BPF). Similar semantics is desirable for XDP, it seems. That's
where bpf_link is useful. Once bpf_link is attached (for specificity,
let's say XDP program to some ifindex), it cannot be replaced with
other bpf_link. Attached bpf_link will need to be detached first (by
means of closing all open FDs) to it. This ensures no-one can
accidentally replace XDP dispatcher program.

Now, once you have bpf_link attached, there will be bpf_link operation
(e.g., BPF_LINK_SWAP or something like that), where underlying BPF
program, associated with bpf_link, will get replaced with a new BPF
program without an interruption. Optionally, we can provide
expected_bpf_program_fd to make sure we are replacing the right
program (for cases where could be few bpf_link owners trying to modify
bpf_link, like in libxdp case). So in that sense bpf_link is a
coordination point, which mediates access to BPF hook (resource).

Thoughts?

I can see how the bpf_link abstraction helps by providing a single
abstraction for all the tracing-type attachments that are fd-based
anyway; but I think I agree with Daniel that maybe it makes more sense
to keep it to those? I.e., I'm not sure what bpf_link adds to XDP
program attachment? The expected_prev_fd field to replace a program
could just as well be provided by extending the existing netlink API?

-Toke





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux