Re: [RFC] bpf: add bpf_link support for BPF_NETFILTER programs

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 30 Jan 2023 13:44:42 -0800

On Mon, Jan 30, 2023 at 07:01:15PM +0100, Florian Westphal wrote:
> Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> > > Is BPF_LINK the right place?  Hook gets removed automatically if the calling program
> > > exits, afaict this is intended.

Yes. bpf_link is the right model.
I'd also allow more than one BPF_NETFILTER prog at the hook.
When Daniel respins his tc bpf_link set there will be a way to do that
for tc and hopefully soon for xdp.
For netfilter hook we can use the same approach.

> > 
> > Yes, this is indeed intended for bpf_link. This plays well with
> > applications that use the API and stick around (because things get
> > cleaned up after them automatically even if they crash, say), but it
> > doesn't work so well for programs that don't (which, notably, includes
> > command line utilities like 'nft').
> 
> Right, but I did not want to create a dependency on nfnetlink or
> nftables netlink right from the start.
> 
> > For XDP and TC users can choose between bpf_link and netlink for
> > attachment and get one of the two semantics (goes away on close or stays
> > put). Not sure if it would make sense to do the same for nftables?
> 
> For nftables I suspect that, if nft can emit bpf, it would make sense to
> pass the bpf descriptor together with nftables netlink, i.e. along with
> the normal netlink data.
> 
> nftables kernel side would then know to use the bpf prog for the
> datapath instead of the interpreter and could even fallback to
> interpreter.
> 
> But for the raw hook use case that Alexei and Daniel preferred (cf.
> initial proposal to call bpf progs from nf_tables interpreter) I think
> that there should be no nftables dependency.
> 
> I could add a new nfnetlink subtype for nf-bpf if bpf_link is not
> appropriate as an alternative.

Let's start with bpf_link and figure out netlink path when appropriate.

> > > Should a program running in init_netns be allowed to attach hooks in other netns too?
> > >
> > > I could do what BPF_LINK_TYPE_NETNS is doing and fetch net via
> > > get_net_ns_by_fd(attr->link_create.target_fd);
> > 
> > We don't allow that for any other type of BPF program; the expectation
> > is that the entity doing the attachment will move to the right ns first.
> > Is there any particular use case for doing something different for
> > nftables?
> 
> Nope, not at all.  I was just curious.  So no special handling for
> init_netns needed, good.
> 
> > > For the actual BPF_NETFILTER program type I plan to follow what the bpf
> > > flow dissector is doing, i.e. pretend prototype is
> > >
> > > func(struct __sk_buff *skb)
> > >
> > > but pass a custom program specific context struct on kernel side.
> > > Verifier will rewrite accesses as needed.
> > 
> > This sounds reasonable, and also promotes code reuse between program
> > types (say, you can write some BPF code to parse a packet and reuse it
> > between the flow dissector, TC and netfilter).
> 
> Ok, thanks.
> 
> > > Things like nf_hook_state->in (net_device) could then be exposed via
> > > kfuncs.
> > 
> > Right, so like:
> > 
> > state = bpf_nf_get_hook_state(ctx); ?
> > 
> > Sounds OK to me.
> 
> Yes, something like that.  Downside is that the nf_hook_state struct
> is no longer bound by any abi contract, but as I understood it thats
> fine.

I'd steer clear from new abi-s.
Don't look at uapi __sk_buff model. It's not a great example to follow.
Just pass kernel nf_hook_state into bpf prog and let program deal
with changes to it via CORE.
The prog will get a defition of 'struct nf_hook_state' from vmlinux.h
or via private 'struct nf_hook_state___flavor' with few fields defined
that prog wants to use. CORE will deal with offset adjustments.
That's a lot less kernel code. No need for asm style ctx rewrites.
Just see how much kernel code we already burned on *convert_ctx_access().
We cannot remove this tech debt due to uapi.
When you pass struct nf_hook_state directly none of it is needed.