I got around to re-running flood benchmarks. Mainly to confirm that introduction of static key had the desired effect - users not attaching BPF sk_lookup programs won't notice a performance hit in Linux v5.9. But also to check for any unexpected bottlenecks when BPF sk_lookup program is attached, like struct in6_addr copying that turned out to be a bad idea in v1. The test setup has been already covered in the cover letter for v1 of the series so I'm not going to repeat it here. Please take a look at "Performance considerations" in [0]. BPF program [1] used during benchmarks has been updated to work with the BPF sk_lookup uAPI in v5. RX pps and CPU cycles events were recorded in 3 configurations: 1. 5.8-rc7 w/o this BPF sk_lookup patch series (baseline), 2. 5.8-rc7 with patches applied, but no SK_LOOKUP program attached, 3. 5.8-rc7 with patches applied, and SK_LOOKUP program attached; BPF program [1] is doing a lookup LPM_TRIE map with 200 entries. RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 sec. | tcp4 SYN flood | rx pps (mean ± sstdev) | Δ rx pps | |------------------------------+------------------------+----------| | 5.8-rc7 vanilla (baseline) | 899,875 ± 1.0% | - | | no SK_LOOKUP prog attached | 889,798 ± 0.6% | -1.1% | | with SK_LOOKUP prog attached | 868,885 ± 1.4% | -3.4% | | tcp6 SYN flood | rx pps (mean ± sstdev) | Δ rx pps | |------------------------------+------------------------+----------| | 5.8-rc7 vanilla (baseline) | 823,364 ± 0.6% | - | | no SK_LOOKUP prog attached | 832,667 ± 0.7% | 1.1% | | with SK_LOOKUP prog attached | 820,505 ± 0.4% | -0.3% | | udp4 0-len flood | rx pps (mean ± sstdev) | Δ rx pps | |------------------------------+------------------------+----------| | 5.8-rc7 vanilla (baseline) | 2,486,313 ± 1.2% | - | | no SK_LOOKUP prog attached | 2,486,932 ± 0.4% | 0.0% | | with SK_LOOKUP prog attached | 2,340,425 ± 1.6% | -5.9% | | udp6 0-len flood | rx pps (mean ± sstdev) | Δ rx pps | |------------------------------+------------------------+----------| | 5.8-rc7 vanilla (baseline) | 2,505,270 ± 1.3% | - | | no SK_LOOKUP prog attached | 2,522,286 ± 1.3% | 0.7% | | with SK_LOOKUP prog attached | 2,418,737 ± 1.3% | -3.5% | cpu-cycles measured with `perf record -F 999 --cpu 1-4 -g -- sleep 60`. | | cpu-cycles events | | | tcp4 SYN flood | __inet_lookup_listener | Δ events | |------------------------------+------------------------+----------| | 5.8-rc7 vanilla (baseline) | 1.31% | - | | no SK_LOOKUP prog attached | 1.24% | -0.1% | | with SK_LOOKUP prog attached | 2.59% | 1.3% | | | cpu-cycles events | | | tcp6 SYN flood | inet6_lookup_listener | Δ events | |------------------------------+------------------------+----------| | 5.8-rc7 vanilla (baseline) | 1.28% | - | | no SK_LOOKUP prog attached | 1.22% | -0.1% | | with SK_LOOKUP prog attached | 3.15% | 1.4% | | | cpu-cycles events | | | udp4 0-len flood | __udp4_lib_lookup | Δ events | |------------------------------+------------------------+----------| | 5.8-rc7 vanilla (baseline) | 3.70% | - | | no SK_LOOKUP prog attached | 4.13% | 0.4% | | with SK_LOOKUP prog attached | 7.55% | 3.9% | | | cpu-cycles events | | | udp6 0-len flood | __udp6_lib_lookup | Δ events | |------------------------------+------------------------+----------| | 5.8-rc7 vanilla (baseline) | 4.94% | - | | no SK_LOOKUP prog attached | 4.32% | -0.6% | | with SK_LOOKUP prog attached | 8.07% | 3.1% | Couple comments: 1. udp6 outperformed udp4 in our setup. The likely suspect is CONFIG_IP_FIB_TRIE_STATS which put fib_table_lookup at the top of perf report when it comes to cpu-cycles w/o counting children. It should have been disabled. 2. When BPF sk_lookup program is attached, the hot spot remains to be copying data to populate BPF context object before each program run. For example, snippet from perf annotate for __udp4_lib_lookup: ---8<--- : rcu_read_lock(); : run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]); 0.01 : ffffffff817f8624: mov 0xd68(%r12),%rsi : if (run_array) { 0.00 : ffffffff817f862c: test %rsi,%rsi 0.00 : ffffffff817f862f: je ffffffff817f87a9 <__udp4_lib_lookup+0x2c9> : struct bpf_sk_lookup_kern ctx = { 1.05 : ffffffff817f8635: xor %eax,%eax 0.00 : ffffffff817f8637: mov $0x6,%ecx 0.01 : ffffffff817f863c: movl $0x110002,0x40(%rsp) 0.00 : ffffffff817f8644: lea 0x48(%rsp),%rdi 18.76 : ffffffff817f8649: rep stos %rax,%es:(%rdi) 1.12 : ffffffff817f864c: mov 0xc(%rsp),%eax 0.00 : ffffffff817f8650: mov %ebp,0x48(%rsp) 0.00 : ffffffff817f8654: mov %eax,0x44(%rsp) 0.00 : ffffffff817f8658: movzwl 0x10(%rsp),%eax 1.21 : ffffffff817f865d: mov %ax,0x60(%rsp) 0.00 : ffffffff817f8662: movzwl 0x20(%rsp),%eax 0.00 : ffffffff817f8667: mov %ax,0x62(%rsp) : .sport = sport, : .dport = dport, : }; : u32 act; : : act = BPF_PROG_SK_LOOKUP_RUN_ARRAY(run_array, ctx, BPF_PROG_RUN); --->8--- Looking at the RX pps drop this is not something we're concerned with ATM. The overhead will drown in cycles burned in iptables, which were intentionally unloaded for the benchmark. If someone has an idea how to tune it, though, I'm all ears. Thanks, -jkbs [0] https://lore.kernel.org/bpf/20200506125514.1020829-1-jakub@xxxxxxxxxxxxxx/ [1] https://github.com/majek/inet-tool/blob/master/ebpf/inet-kern.c