On 10/07, Toke H�iland-J�rgensen wrote:
Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes:
> On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
>> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx>
wrote:
>>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
>>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
>>> [...]
>>>>
>>>> I cannot help but feel that prio logic copy-paste from old tc,
netfilter and friends
>>>> is done because "that's how things were done in the past".
>>>> imo it was a well intentioned mistake and all networking things (tc,
netfilter, etc)
>>>> copy-pasted that cumbersome and hard to use concept.
>>>> Let's throw away that baggage?
>>>> In good set of cases the bpf prog inserter cares whether the prog is
first or not.
>>>> Since the first prog returning anything but TC_NEXT will be final.
>>>> I think prog insertion flags: 'I want to run first' vs 'I don't care
about order'
>>>> is good enough in practice. Any complex scheme should probably be
programmable
>>>> as any policy should. For example in Meta we have 'xdp chainer'
logic that is similar
>>>> to libxdp chaining, but we added a feature that allows a prog to
jump over another
>>>> prog and continue the chain. Priority concept cannot express that.
>>>> Since we'd have to add some "policy program" anyway for use cases
like this
>>>> let's keep things as simple as possible?
>>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
>>>> And allow bpf progs chaining in the kernel with "run_me_first"
vs "run_me_anywhere"
>>>> in both tcx and xdp ?
>>>> Naturally "run_me_first" prog will be the only one. No need for
F_REPLACE flags, etc.
>>>> The owner of "run_me_first" will update its prog through
bpf_link_update.
>>>> "run_me_anywhere" will add to the end of the chain.
>>>> In XDP for compatibility reasons "run_me_first" will be the default.
>>>> Since only one prog can be enqueued with such flag it will match
existing single prog behavior.
>>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs)
will use "run_me_anywhere".
>>>> I know it's far from covering plenty of cases that we've discussed
for long time,
>>>> but prio concept isn't really covering them either.
>>>> We've struggled enough with single xdp prog, so certainly not
advocating for that.
>>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail".
Just as simple.
>>>> Both simple versions have their pros and cons and don't cover
everything,
>>>> but imo both are better than prio.
>>>
>>> Yeah, it's kind of tricky, imho. The 'run_me_first'
vs 'run_me_anywhere' are two
>>> use cases that should be covered (and actually we kind of do this in
this set, too,
>>> with the prios via prio=x vs prio=0). Given users will only be
consuming the APIs
>>> via libs like libbpf, this can also be abstracted this way w/o users
having to be
>>> aware of prios.
>>
>> but the patchset tells different story.
>> Prio gets exposed everywhere in uapi all the way to bpftool
>> when it's right there for users to understand.
>> And that's the main problem with it.
>> The user don't want to and don't need to be aware of it,
>> but uapi forces them to pick the priority.
>>
>>> Anyway, where it gets tricky would be when things depend on ordering,
>>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring,
encryption, which
>>> would be sth you can build today via tc BPF: so policy one acts as a
prefilter for
>>> various cidr ranges that should be blocked no matter what, then
monitoring to sample
>>> what goes into the lb, then lb itself which does snat/dnat, then
monitoring to see what
>>> the corresponding pkt looks that goes to backend, and maybe
encryption to e.g. send
>>> the result to wireguard dev, so it's encrypted from lb node to
backend.
>>
>> That's all theory. Your cover letter example proves that in
>> real life different service pick the same priority.
>> They simply don't know any better.
>> prio is an unnecessary magic that apps _have_ to pick,
>> so they just copy-paste and everyone ends up using the same.
>>
>>> For such
>>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee
order, so there's
>>> a case for both scenarios (concrete layout vs loose one), and for
latter we could
>>> start off with and internal prio around x (e.g. 16k), so there's room
to attach in
>>> front via fixed prio, but also append to end for 'don't care', and
that could be
>>> from lib pov the default/main API whereas prio would be some kind of
extended one.
>>> Thoughts?
>>
>> If prio was not part of uapi, like kernel internal somehow,
>> and there was a user space daemon, systemd, or another bpf prog,
>> module, whatever that users would interface to then
>> the proposed implementation of prio would totally make sense.
>> prio as uapi is not that.
>
> A good analogy to this issue might be systemd's unit files.. you
specify dependencies
> for your own <unit> file via 'Wants=<unitA>', and ordering
via 'Before=<unitB>' and
> 'After=<unitC>' and they refer to other unit files. I think that is
generally okay,
> you don't deal with prio numbers, but rather some kind textual
representation. However
> user/operator will have to deal with dependencies/ordering one way or
another, the
> problem here is that we deal with kernel and loader talks to kernel
directly so it
> has no awareness of what else is running or could be running, so apps
needs to deal
> with it somehow (and it cannot without external help).
I was thinking a little about how this might work; i.e., how can the
kernel expose the required knobs to allow a system policy to be
implemented without program loading having to talk to anything other
than the syscall API?
How about we only expose prepend/append in the prog attach UAPI, and
then have a kernel function that does the sorting like:
int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
bpf_prog *new_prog, bool append)
where the default implementation just appends/prepends to the array in
progs depending on the value of 'appen'.
And then use the __weak linking trick (or maybe struct_ops with a member
for TXC, another for XDP, etc?) to allow BPF to override the function
wholesale and implement whatever ordering it wants? I.e., allow it can
to just shift around the order of progs in the 'progs' array whenever a
program is loaded/unloaded?
This way, a userspace daemon can implement any policy it wants by just
attaching to that hook, and keeping things like how to express
dependencies as a userspace concern?
What if we do the above, but instead of simple global 'attach first/last',
the default api would be:
- attach before <target_fd>
- attach after <target_fd>
- attach before target_fd=-1 == first
- attach after target_fd=-1 == last
?
That might be flexible enough by default to allow users to
append/prepend to any existing program in the chain (say, for
monitoring). Flexible enough for some central daemons to do
systemd-style policy. And, with bpf_add_new_tcx_prog, flexible
enough to implement any policy?