Patches 1/6 and 2/6, as discussed with Pablo, introduce support and switching mechanisms for 8-bit packet matching groups. I opted to pick the fitting implementation with conditionals, instead of replacing the set lookup operation on the fly, as this allows for fields with different group sizes. The cost of these conditionals actually appears negligible. For the non-vectorised case, the two implementations are almost identical and mostly remain as a single function, while, at least for AVX2, operation sequences turned out to be fairly different, so the new matching functions for 8-bit groups are all separated. As a side note, I also tried out Pablo's suggestion to use the stack for scratch maps, instead of per-CPU pre-allocated ones, if bucket sizes are small enough. The outcome was rather surprising: it looks cheaper, at least on x86_64, to access pre-allocated data compared to initialise the room we need on the stack. Patches 3/6 and 4/6 are similar to what I posted earlier, and they are preparation work for vectorised implementations: we need to support arbitrary requirements about data alignment and we also need to share some helper functions. Patch 5/6 implements the AVX2 lookup routines, now supporting 4-bit and 8-bit as group sizes. Patch 6/6 adjusts the set implementations to also support sets with a single, ranged field. This can be now conveniently enabled with a define, and allows us to have a fair comparison with the rbtree set back-end. The matching rate figures below were obtained with the usual kselftests cases, averaged over five runs, on a single thread of an AMD Epyc 7402 CPU for x86_64 and on a single BCM2711 thread (Raspberry Pi 4 Model B clocked at 2147MHz) for a comparison with ARM 64-bit. Note that I disabled retpolines (on x86_64) and SSBD (on aarch64), so these matching rates can't be directly compared to figures I shared previously -- hence the new baselines (also repeated in single patch messages). For some reason, I'm getting more repeatable numbers this way, and we're probably going to get rid of a number of indirect calls in the future anyway. By hardcoding calls to set lookup functions, I'm getting numbers rather close to these baselines even with CONFIG_RETPOLINE set. Also note, as it was the case earlier, that this is not a fair comparison with hash types, because hash types don't support ranges. Matching rates for concatenated ranges: ---------------.-----------------------------------.-------------------------. AMD Epyc 7402 | baselines, Mpps | this series, Mpps | 1 thread |___________________________________|_________________________| 3.35GHz | | | | | | | 768KiB L1D$ | netdev | hash | rbtree | | | | ---------------| hook | no | single | pipapo | pipapo | pipapo | type entries | drop | ranges | field | 4 bits | bit switch | AVX2 | ---------------|--------|--------|--------|--------|------------|------------| net,port | | | | | | | 1000 | 19.0 | 10.4 | 3.8 | 2.8 | 4.0 +43% | 7.5 +168% | ---------------|--------|--------|--------|--------|------------|------------| port,net | | | | | | | 100 | 18.8 | 10.3 | 5.8 | 5.5 | 6.3 +14% | 8.1 +47% | ---------------|--------|--------|--------|--------|------------|------------| net6,port | | | | | | | 1000 | 16.4 | 7.6 | 1.8 | 1.3 | 2.1 +61% | 4.8 +269% | ---------------|--------|--------|--------|--------|------------|------------| port,proto | | | | | [1] | [1] | 30000 | 19.6 | 11.6 | 3.9 | 0.3 | 0.5 +66% | 2.6 +766% | ---------------|--------|--------|--------|--------|------------|------------| net6,port,mac | | | | | | | 10 | 16.5 | 5.4 | 4.3 | 2.6 | 3.4 +31% | 4.7 +81% | ---------------|--------|--------|--------|--------|------------|------------| net6,port,mac, | | | | | | | proto 1000 | 16.5 | 5.7 | 1.9 | 1.0 | 1.4 +40% | 3.6 +260% | ---------------|--------|--------|--------|--------|------------|------------| net,mac | | | | | | | 1000 | 19.0 | 8.4 | 3.9 | 1.7 | 2.5 +47% | 6.4 +276% | ---------------'--------'--------'--------'--------'------------'------------' [1] Causes switch of lookup table buckets for 'port' to 4-bit groups ---------------.-----------------------------------.------------. BCM2711 | baselines, Mpps | patch 2/6 | 1 thread |___________________________________|____________| 2147MHz | | | | | | 32KiB L1D$ | netdev | hash | rbtree | | | ---------------| hook | no | single | pipapo | pipapo | type entries | drop | ranges | field | 4 bits | bit switch | ---------------|--------|--------|--------|--------|------------| net,port | | | | | | 1000 | 1.63 | 1.37 | 0.87 | 0.61 | 0.70 +17% | ---------------|--------|--------|--------|--------|------------| port,net | | | | | | 100 | 1.64 | 1.36 | 1.02 | 0.78 | 0.81 +4% | ---------------|--------|--------|--------|--------|------------| net6,port | | | | | | 1000 | 1.56 | 1.27 | 0.65 | 0.34 | 0.50 +47% | ---------------|--------|--------|--------|--------|------------| port,proto [1] | | | | | | 10000 | 1.68 | 1.43 | 0.84 | 0.30 | 0.40 +13% | ---------------|--------|--------|--------|--------|------------| net6,port,mac | | | | | | 10 | 1.56 | 1.14 | 1.02 | 0.62 | 0.66 +6% | ---------------|--------|--------|--------|--------|------------| net6,port,mac, | | | | | | proto 1000 | 1.56 | 1.12 | 0.64 | 0.27 | 0.40 +48% | ---------------|--------|--------|--------|--------|------------| net,mac | | | | | | 1000 | 1.63 | 1.26 | 0.87 | 0.41 | 0.53 +29% | ---------------'--------'--------'--------'--------'------------' [1] Using 10000 entries instead of 30000 as it would take way too long for the test script to generate all of them Matching rates for non-concatenated ranges (first field): ---------------.--------------------------.-------------------------. AMD Epyc 7402 | baselines, Mpps | Mpps, % over rbtree | 1 thread |__________________________|_________________________| 3.35GHz | | | | | | 768KiB L1D$ | netdev | hash | rbtree | | pipapo | ---------------| hook | no | single | pipapo |single field| type entries | drop | ranges | field |single field| AVX2 | ---------------|--------|--------|--------|------------|------------| net,port | | | | | | 1000 | 19.0 | 10.4 | 3.8 | 6.0 +58% | 9.6 +153% | ---------------|--------|--------|--------|------------|------------| port,net | | | | | | 100 | 18.8 | 10.3 | 5.8 | 9.1 +57% |11.6 +100% | ---------------|--------|--------|--------|------------|------------| net6,port | | | | | | 1000 | 16.4 | 7.6 | 1.8 | 2.8 +55% | 6.5 +261% | ---------------|--------|--------|--------|------------|------------| port,proto | | | | [1] | [1] | 30000 | 19.6 | 11.6 | 3.9 | 0.9 -77% | 2.7 -31% | ---------------|--------|--------|--------|------------|------------| port,proto | | | | | | 10000 | 19.6 | 11.6 | 4.4 | 2.1 -52% | 5.6 +27% | ---------------|--------|--------|--------|------------|------------| port,proto | | | | | | 4 threads 10000| 77.9 | 45.1 | 17.4 | 8.3 -52% |22.4 +29% | ---------------|--------|--------|--------|------------|------------| net6,port,mac | | | | | | 10 | 16.5 | 5.4 | 4.3 | 4.5 +5% | 8.2 +91% | ---------------|--------|--------|--------|------------|------------| net6,port,mac, | | | | | | proto 1000 | 16.5 | 5.7 | 1.9 | 2.8 +47% | 6.6 +247% | ---------------|--------|--------|--------|------------|------------| net,mac | | | | | | 1000 | 19.0 | 8.4 | 3.9 | 6.0 +54% | 9.9 +154% | ---------------'--------'--------'--------'------------'------------' [1] Causes switch of lookup table buckets for 'port' to 4-bit groups v2: Rebase, especially as series "netfilter: nf_tables: make sets built-in" was merged, add 6/6 as new patch and single-field comparison with rbtree. Stefano Brivio (6): nft_set_pipapo: Generalise group size for buckets nft_set_pipapo: Add support for 8-bit lookup groups and dynamic switch nft_set_pipapo: Prepare for vectorised implementation: alignment nft_set_pipapo: Prepare for vectorised implementation: helpers nft_set_pipapo: Introduce AVX2-based lookup implementation nft_set_pipapo: Prepare for single ranged field usage include/net/netfilter/nf_tables_core.h | 1 + net/netfilter/Makefile | 6 + net/netfilter/nf_tables_api.c | 3 + net/netfilter/nft_set_pipapo.c | 630 +++++++----- net/netfilter/nft_set_pipapo.h | 280 ++++++ net/netfilter/nft_set_pipapo_avx2.c | 1223 ++++++++++++++++++++++++ net/netfilter/nft_set_pipapo_avx2.h | 14 + 7 files changed, 1893 insertions(+), 264 deletions(-) create mode 100644 net/netfilter/nft_set_pipapo.h create mode 100644 net/netfilter/nft_set_pipapo_avx2.c create mode 100644 net/netfilter/nft_set_pipapo_avx2.h -- 2.25.1