netfilter+bpf road ahead

Florian Westphal <fw@xxxxxxxxx> · Tue, 18 Oct 2022 20:26:37 +0200

Hello,

This is a summary of what Alexei Starovoitov and myself talked about
in our meeting in Zurich.  Most of this was written by Alexei, with
minor edits and additions from me.

- Alexei and Florian met in Zurich to discuss netfilter and bpf.
  netfilter (core, ipables, ebtables, nftables ...) all take heavy
  performance hits on retpoline enabled kernels due to
  indiscriminate use of indirect calls.
  Over the years nftables grew a large number of workarounds to keep
acceptable performance for common case.
  In few places indirect calls were replaced with large if (tgt ==
&fn1) fn1(); else if (tgt == &fn2) fn2(); else ...
  [link1](https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/netfilter/nf_tables_core.c#n256)
  [link2](https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/netfilter/nf_tables_core.c#n198)
  In other place a set of giant switch statements were used.
  [link](https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/netfilter/nft_meta.c#n309)
  The 3rd bottleneck couldn't be done with either if-s or switch and
Florian proposed to accelerate it with [generated bpf
code](https://lore.kernel.org/bpf/20221005141309.31758-1-fw@xxxxxxxxx/).
  The NFT VM wasn't flexible enough either. Despite large engineering
investment it still lacks some abilities that are needed for feature
parity with iptables.
RHEL and Fedora changed the iptables default to iptables-nft, but
iptables-nft implements feature parity with ipables by calling into the
x_tables modules.  This needs two indirect calls for each match
(call to nft_compat expression, then call to the xtables target or
 match function).
iptables-nft can be changed gradually to replace matches with nft-native
expressions to avoid this.
But in some cases the modules/targets have a feature that
cannot be emulated with the existing nft vm.

One example is ability to only store parts of skb->mark.
The nft grammar would allow to do:

ct mark set (ct mark & 0xffffff00) | (meta mark & 0xff)

... which would stash the lower 8bit of skb->mark while keeping
the upper 24 bits of the connmark intact.

But neither frontend or backend (kernel) can handle it, because it needs
support for:

  regA = regB | regC

nf_tables only allows
  regA = regB BINOP VALUE.

In the example given above, the problem is the right hand side
of the OR -- its not a constant value.

ct mark set (ct mark & 0xffffff00) | 1

... would work.

Patches that allow two source registers are floating around on mailing
list but have not been applied so far.

Some customers use xt_bpf with either classic_bf or ebpf, so Florian proposed
nft->ebpf, but Daniel Borkmann and Alexei argued against.
The key promise of NFT was flexible packet parsing. Turns out that there
are users that would benefit from programmable parsing, e.g. to extract
sni from certificates or hostnames from DNS replies.

After many hours of brainstorming we came up with the plan:
 - cleanup and land bpf generator to accelerate one of nf bottlenecks.
 - introduce new stable BPF_PROG_TYPE_NETFILTER. Alexeis preference
was to avoid new prog types and use unstable hooks,
but iptables are scoped by network namespaces. We could use
xdp_dispatcher-like generator to demux bpf prog per netns,
but netns removal automatically flushes iptable rules, so netns
would need to know about this bpf dispatcher and unload
bpf-netfilter prog. At that point the amount of user facing
"implementation details" becomes so large that calling
such hooks "unstable" isn't realistic.
- return values from this prog type will be existing netfilter codes
except NF_STOLEN.
- allow BPF_PROG_TYPE_NETFILTER to attach to all netfilter/iptables
hooks where program context will be uapi 'struct bpf_netfilter'
At that point the stable part of the interface ends. From input
context the program will be able to access skb, socket, nents, netdev
pointers and read them with the help of CO-RE and BTF.
- attach uapi will be done either with bpf_link and FD or with
netlink using a tuple (netns, nf_family, nf_hook, bpf_prog)
- introduce a set of kfuncs to access conntrack, nat, nft sets and maps,
nf_queue and so on.
- in addition to existing two iptables rules converters in user
space (iptables->nft text-to-text and iptables->nft text-to-netlink)
the latter will be augmented to generate BPF_PROG_TYPE_NETFILTER
prog as well.  bpf-aware nft frontend would pass both the nft instructions
(for netfilter monitor and netlink query purposes) and a bpf_prog, but will execute
bpf program in run-time.
The bpf prog doesn't have to have 100% feature parity. It can fall
back to NFT core for not-yet-implemented expressions.
- nft_set_pipapo.c is an efficient classification map for arbitrary ranges
represented as 'nft set' from uapi pov.
bpf side might interface to it directly via kfuncs.
- lots of details to be figured out, but if netfilter core folks
agree to this plan it will be one of the most exciting
projects in the linux networking. iptables will see significant
performance boost and major feature addition.
Blending bpf and netfilter worlds would be fantastic.

Florian will rework the last RFC patchset and will re-run
benchmarks with both RETPOLINE=n|y, results should be available
mid-november-ish.