Sending as another RFC even though patches are unchanged vs. last iteration to provide background/context ahead of bpf office hours on Oct 6th, thus deliberately omitting netdev@ and nf-devel@. This series adds a bpf program generator for netfilter base hooks. 'netfilter base hooks' are c-functions that get called from the NF_HOOK() stubs that can be found in a myriad of locations in the network stack. Examples from ipv4 (ip_input.c): 254 return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, 255 net, NULL, skb, skb->dev, NULL, 256 ip_local_deliver_finish); [..] 564 return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, 565 net, NULL, skb, dev, NULL, 566 ip_rcv_finish); Well-known users of this facility are iptables, nftables, but also connection tracking selinux. Conntrack is also a greedy module, with 5 hooks total (prerouting, input, output, postrouting) and another two via nf_defrag(_ipv4) module dependency. Eliding the static-key handling, NF_HOOK() expands to: ----- struct nf_hook_entries *hooks = rcu_dereference(net->nf.hooks_ipv4[hook]); /* where '[hook] is any one of prerouting, input, and so on */ ret = nf_hook_slow(skb, &state, hooks, 0); if (ret == 1) /* packet is allowed to pass */ okfn(net, sk, skb); ------ 'hooks' is an array of function-address/void * arg pairs that is iterated in nf_hook_slow(): for i in hooks[]; do verdict = hooks[i]->addr(hooks->[i].arg, skb, state); switch (verdict) { .... Each hook can chose to toss the packet (NF_DROP), move to next hook (NF_ACCEPT), or assume skb ownership (NF_STOLEN) and so on. All hooks have access to the skb, to the private void *arg (used by nf_tables and ip_tables -- the start of the user-defined ruleset to evaluate) and a context structure that wraps extra data: incoming and outgoing network interfaces, the net namespace the hook is registered in, the protocol family, hook location (input, prerouting, forward, ...) ... Even for simple iptables-filter + nat this results in multiple indirect calls per packet. The proposed autogenerator unrolls nf_hook_slow() and builds a bpf program that performs those function calls sequentially, i.e.: state->priv = hooks->[0].hook_arg; v = firstfunction(state); if (v != ACCEPT) goto out; state->priv = hooks->[1].hook_arg; v = secondfunction(state); ... if (v != ACCEPT) goto out; ... and so on. As the function arguments are still taken from struct net at runtime, rather than added as constants, those programs can be shared across net namespaces if they share the exact same registered hooks. (Example: 10 netns with iptables-filter table and active conntrack will all share the same 5 programs (one for prerouting, input, output and postrouting each), rather than 50 bpf programs. Invocation of the autogenerated programs is done via bpf dispatcher from nf_hook(); instead of ret = nf_hook_slow( ... ) this is now: ------------------ struct bpf_prog *prog = READ_ONCE(e->hook_prog); state.priv = (void *)e; state.skb = skb; migrate_disable(); ret = __bpf_prog_run(prog, state, BPF_DISPATCHER_FUNC(nf_hook_base)); migrate_enable(); ------------------ As long as NF_QUEUE is not used -- which should be rare -- data path will not call nf_hook_slow "interpreter" anymore. No changes in BPF core or UAPI additions, although I suppose it would make sense to add a 'enable/disable' sysctl for this. I think that it makes little sense to consider any form of nf_tables (or iptables) JIT without indirect-call avoidance first, unless such 'jit' would be for the XDP hook. I would propose 'xdptables' tool for that though (or 'xdp' family for nftables), without kernel changes. Comments welcome. Florian Westphal (9): netfilter: nf_queue: carry index in hook state netfilter: nat: split nat hook iteration into a helper netfilter: remove hook index from nf_hook_slow arguments netfilter: make hook functions accept only one argument netfilter: reduce allowed hook count to 32 netfilter: add bpf base hook program generator netfilter: core: do not rebuild bpf program on dying netns netfilter: netdev: switch to invocation via bpf netfilter: hook_jit: add prog cache drivers/net/ipvlan/ipvlan_l3s.c | 4 +- include/linux/netfilter.h | 82 ++- include/linux/netfilter_arp/arp_tables.h | 3 +- include/linux/netfilter_bridge/ebtables.h | 3 +- include/linux/netfilter_ipv4/ip_tables.h | 4 +- include/linux/netfilter_ipv6/ip6_tables.h | 3 +- include/linux/netfilter_netdev.h | 33 +- include/net/netfilter/br_netfilter.h | 7 +- include/net/netfilter/nf_flow_table.h | 6 +- include/net/netfilter/nf_hook_bpf.h | 21 + include/net/netfilter/nf_queue.h | 3 +- include/net/netfilter/nf_synproxy.h | 6 +- net/bridge/br_input.c | 3 +- net/bridge/br_netfilter_hooks.c | 30 +- net/bridge/br_netfilter_ipv6.c | 5 +- net/bridge/netfilter/ebtable_broute.c | 9 +- net/bridge/netfilter/ebtables.c | 6 +- net/bridge/netfilter/nf_conntrack_bridge.c | 8 +- net/ipv4/netfilter/arp_tables.c | 7 +- net/ipv4/netfilter/ip_tables.c | 7 +- net/ipv4/netfilter/ipt_CLUSTERIP.c | 6 +- net/ipv4/netfilter/iptable_mangle.c | 15 +- net/ipv4/netfilter/nf_defrag_ipv4.c | 5 +- net/ipv6/ila/ila_xlat.c | 6 +- net/ipv6/netfilter/ip6_tables.c | 6 +- net/ipv6/netfilter/ip6table_mangle.c | 13 +- net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 5 +- net/netfilter/Kconfig | 10 + net/netfilter/Makefile | 1 + net/netfilter/core.c | 121 ++++- net/netfilter/ipvs/ip_vs_core.c | 13 +- net/netfilter/nf_conntrack_proto.c | 34 +- net/netfilter/nf_flow_table_inet.c | 8 +- net/netfilter/nf_flow_table_ip.c | 12 +- net/netfilter/nf_hook_bpf.c | 574 +++++++++++++++++++++ net/netfilter/nf_nat_core.c | 50 +- net/netfilter/nf_nat_proto.c | 56 +- net/netfilter/nf_queue.c | 12 +- net/netfilter/nf_synproxy_core.c | 8 +- net/netfilter/nft_chain_filter.c | 48 +- net/netfilter/nft_chain_nat.c | 7 +- net/netfilter/nft_chain_route.c | 22 +- security/apparmor/lsm.c | 5 +- security/selinux/hooks.c | 22 +- security/smack/smack_netfilter.c | 8 +- 45 files changed, 1044 insertions(+), 273 deletions(-) create mode 100644 include/net/netfilter/nf_hook_bpf.h create mode 100644 net/netfilter/nf_hook_bpf.c -- 2.35.1