I'll still need some time to finish up the ARM NEON vectorised implementation, so I thought I'd start posting patches introducing support for 8-bit groups and the related adaptation of the (previously posted) AVX2-based vectorised implementation. Patches 1/5 and 2/5, as discussed with Pablo, introduce support and switching mechanisms for 8-bit packet matching groups. I opted to pick the fitting implementation with conditionals, instead of replacing the set lookup operation on the fly, as this allows for fields with different group sizes. The cost of these conditionals actually appears negligible. For the non-vectorised case, the two implementations are almost identical and mostly remain as a single function, while, at least for AVX2, operation sequences turned out to be fairly different, so the new matching functions for 8-bit groups are all separated. As a side note, I also tried out Pablo's suggestion to use the stack for scratch maps, instead of per-CPU pre-allocated ones, if bucket sizes are small enough. The outcome was rather surprising: it looks cheaper, at least on x86_64, to access pre-allocated data compared to initialise the room we need on the stack. Patches 3/5 and 4/5 are similar to what I posted earlier, and they are preparation work for vectorised implementations: we need to support arbitrary requirements about data alignment and we also need to share some helper functions. Patch 5/5 implements the AVX2 lookup routines, now supporting 4-bit and 8-bit as group sizes. The matching rate figures below were obtained with the usual kselftests cases, averaged over five runs, on a single thread of an AMD Epyc 7402 CPU for x86_64 and on a single BCM2711 thread (Raspberry Pi 4 Model B clocked at 2147MHz) for a comparison with ARM 64-bit. Note that I disabled retpolines (on x86_64) and SSBD (on aarch64), so these matching rates can't be directly compared to figures I shared previously -- hence the new baselines (also repeated in single patch messages). For some reason, I'm getting more repeatable numbers this way, and we're probably going to get rid of a number of indirect calls in the future anyway. By hardcoding calls to set lookup functions, I'm getting numbers rather close to these baselines even with CONFIG_RETPOLINE set. Also note, as it was the case earlier, that this is not a fair comparison with hash and rbtree types, because hash types don't support ranges and rbtree doesn't support multiple fields. Especially matching on a single field is significantly faster than this. Some minor adjustments are still needed to properly support matching on less than two fields, though. Once they are implemented, we could at least get a fair comparison with rbtree. ---------------.-----------------------------------.-------------------------. AMD Epyc 7402 | baselines, Mpps | this series, Mpps | 1 thread |___________________________________|_________________________| 3.35GHz | | | | | | | 768KiB L1D$ | netdev | hash | rbtree | | | | ---------------| hook | no | single | pipapo | pipapo | pipapo | type entries | drop | ranges | field | 4 bits | bit switch | AVX2 | ---------------|--------|--------|--------|--------|------------|------------| net,port | | | | | | | 1000 | 19.0 | 10.4 | 3.8 | 2.8 | 4.0 +43% | 7.5 +168% | ---------------|--------|--------|--------|--------|------------|------------| port,net | | | | | | | 100 | 18.8 | 10.3 | 5.8 | 5.5 | 6.3 +14% | 8.1 +47% | ---------------|--------|--------|--------|--------|------------|------------| net6,port | | | | | | | 1000 | 16.4 | 7.6 | 1.8 | 1.3 | 2.1 +61% | 4.8 +269% | ---------------|--------|--------|--------|--------|------------|------------| port,proto | | | | | [1] | | 30000 | 19.6 | 11.6 | 3.9 | 0.3 | 0.5 +66% | 2.6 +766% | ---------------|--------|--------|--------|--------|------------|------------| net6,port,mac | | | | | | | 10 | 16.5 | 5.4 | 4.3 | 2.6 | 3.4 +31% | 4.7 +81% | ---------------|--------|--------|--------|--------|------------|------------| net6,port,mac, | | | | | | | proto 1000 | 16.5 | 5.7 | 1.9 | 1.0 | 1.4 +40% | 3.6 +260% | ---------------|--------|--------|--------|--------|------------|------------| net,mac | | | | | | | 1000 | 19.0 | 8.4 | 3.9 | 1.7 | 2.5 +47% | 6.4 +276% | ---------------'--------'--------'--------'--------'------------'------------' [1] Causes switch of lookup table buckets for 'port' to 4-bit groups ---------------.-----------------------------------.------------. BCM2711 | baselines, Mpps | patch 2/5 | 1 thread |___________________________________|____________| 2147MHz | | | | | | 32KiB L1D$ | netdev | hash | rbtree | | | ---------------| hook | no | single | pipapo | pipapo | type entries | drop | ranges | field | 4 bits | bit switch | ---------------|--------|--------|--------|--------|------------| net,port | | | | | | 1000 | 1.63 | 1.37 | 0.87 | 0.61 | 0.70 +17% | ---------------|--------|--------|--------|--------|------------| port,net | | | | | | 100 | 1.64 | 1.36 | 1.02 | 0.78 | 0.81 +4% | ---------------|--------|--------|--------|--------|------------| net6,port | | | | | | 1000 | 1.56 | 1.27 | 0.65 | 0.34 | 0.50 +47% | ---------------|--------|--------|--------|--------|------------| port,proto [1] | | | | | | 10000 | 1.68 | 1.43 | 0.84 | 0.30 | 0.40 +13% | ---------------|--------|--------|--------|--------|------------| net6,port,mac | | | | | | 10 | 1.56 | 1.14 | 1.02 | 0.62 | 0.66 +6% | ---------------|--------|--------|--------|--------|------------| net6,port,mac, | | | | | | proto 1000 | 1.56 | 1.12 | 0.64 | 0.27 | 0.40 +48% | ---------------|--------|--------|--------|--------|------------| net,mac | | | | | | 1000 | 1.63 | 1.26 | 0.87 | 0.41 | 0.53 +29% | ---------------'--------'--------'--------'--------'------------' [1] Using 10000 entries instead of 30000 as it would take way too long for the test script to generate all of them Stefano Brivio (5): nft_set_pipapo: Generalise group size for buckets nft_set_pipapo: Add support for 8-bit lookup groups and dynamic switch nft_set_pipapo: Prepare for vectorised implementation: alignment nft_set_pipapo: Prepare for vectorised implementation: helpers nft_set_pipapo: Introduce AVX2-based lookup implementation include/net/netfilter/nf_tables_core.h | 1 + net/netfilter/Makefile | 5 + net/netfilter/nf_tables_set_core.c | 6 + net/netfilter/nft_set_pipapo.c | 614 +++++++----- net/netfilter/nft_set_pipapo.h | 277 ++++++ net/netfilter/nft_set_pipapo_avx2.c | 1222 ++++++++++++++++++++++++ net/netfilter/nft_set_pipapo_avx2.h | 14 + 7 files changed, 1881 insertions(+), 258 deletions(-) create mode 100644 net/netfilter/nft_set_pipapo.h create mode 100644 net/netfilter/nft_set_pipapo_avx2.c create mode 100644 net/netfilter/nft_set_pipapo_avx2.h -- 2.25.0