Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API

Jamal Hadi Salim <jhs@xxxxxxxxxxxx> · Wed, 16 Jun 2021 11:33:38 -0400

On 2021-06-15 7:44 p.m., Daniel Borkmann wrote:
On 6/15/21 1:54 PM, Toke Høiland-Jørgensen wrote:
Cong Wang <xiyou.wangcong@xxxxxxxxx> writes:
[...]

Just to keep on brainstorming here, I wanted to bring back Alexei's 
earlier quote:

   > I think it makes sense to create these objects as part of 
establishing bpf_link.
   > ingress qdisc is a fake qdisc anyway.
   > If we could go back in time I would argue that its existence doesn't
   > need to be shown in iproute2. It's an object that serves no purpose
   > other than attaching filters to it. It doesn't do any queuing unlike
   > real qdiscs.
   > It's an artifact of old choices. Old doesn't mean good.
   > The kernel is full of such quirks and oddities. New api-s shouldn't
   > blindly follow them.
   > tc qdisc add dev eth0 clsact
   > is a useless command with nop effect.

I am not sure what Alexei's statement about old vs good was getting at.
You have to have  hooks/locations to stick things. Does it matter what
you call that hook?

The whole bpf_link in this context feels somewhat awkward because both 
are two
different worlds, one accessible via netlink with its own lifetime etc, 
the other
one tied to fds and bpf syscall. Back in the days we did the cls_bpf 
integration
since it felt the most natural at that time and it had support for both 
the ingress
and egress side, along with the direct action support which was added 
later to have
a proper fast path for BPF. One thing that I personally never liked is 
that later
on tc sadly became a complex, quirky dumping ground for all the nic hw 
offloads (I
guess mainly driven from ovs side) for which I have a hard time 
convincing myself
that this is used at scale in production. Stuff like af699626ee26 just 
to pick one
which annoyingly also adds to the fast path given distros will just 
compile in most
of these things (like NET_TC_SKB_EXT)... what if such bpf_link object is 
not tied
at all to cls_bpf or cls_act qdisc, and instead would implement the 
tcf_classify_
{egress,ingress}() as-is in that sense, similar like the bpf_lsm hooks. 

The choice is between generic architecture and appliance
only-what-you-need code (via ebpf).
Dont disagree that at times patches go in at the expense of the kernel
datapath complexity or cost. Unfortunately sometimes this is because
theres no sufficient review time - but thats a different topic.
We try to impose a rule which states that any hardware offload has
to have a kernel/software twin. Often that helps contain things.

Meaning,
you could run existing tc BPF prog without any modifications and without 
additional
extra overhead (no need to walk the clsact qdisc and then again into the 
cls_bpf
one). These tc BPF programs would be managed only from bpf() via tc 
bpf_link api,
and are otherwise not bothering to classic tc command (though they could 
be dumped
there as well for sake of visibility, though bpftool would be fitting 
too). However,
if there is something attached from classic tc side, it would also go 
into the old
style tcf_classify_ingress() implementation and walk whatever is there 
so that nothing
existing breaks (same as when no bpf_link would be present so that there 
is no extra
overhead). This would also allow for a migration path of multi prog from 
cls_bpf to
this new implementation. Details still tbd, but I would much rather like 
such an
approach than the current discussed one, and it would also fit better 
given we don't
run into this current mismatch of both worlds.

The danger is totally divorcing from tc when you have speacial
cases just for ebpf/tc i.e this is no different than what the
hardware offload making you unhappy. The ability to use existing
tools (user space tc in this case) to inter-work on both is
very useful.

From the discussion on the control aspect with Kartikeya i
understood that we need some "transient state" which needs to get
created and stored somewhere before being applied to tc (example
creating the filters first and all necessary artificats then calling
internally to cls api).
Seems to me that the "transient state" belongs to bpf. And i understood
Kartikeya this was his design intent as well (which seems sane to me).

cheers,
jamal