Hi Florian, On Thu, 13 May 2021 22:29:54 +0200 Florian Westphal <fw@xxxxxxxxx> wrote: > This adds a nft_set_do_lookup() helper, then extends it to use > direct calls when RETPOLINE feature is enabled. > > For non-retpoline builds, nft_set_do_lookup() inline helper > does a indirect call. INDIRECT_CALLABLE_SCOPE macro allows to > keep the lookup functions static in this case. Thanks for doing this! And sorry I looked into it more than one year ago without ever finishing it ;) I ran some quick tests, I was curious to see the impact of dropping indirect calls on that path. With the 'performance' test cases of nft_concat_range.sh, roughly estimating clock cycles as clock frequency divided by packet rate, it looks like this offsets entirely the usage of retpolines! With a 'return true;' in the lookup function (I patched nft_set_pipapo), on my usual single AMD Epyc 7351 thread, 2.9GHz, average of three runs, I get: | packet | est. | | rate | cycles | | (Mpps) | | -----------------------------------------------|--------|--------| Without retpolines, netdev drop | 15.443 | 188 | Without retpolines, dummy lookup function | 9.995 | 292 | -> Without retpolines, set lookup | | 104-|-. - - - - - - - - - - - - - - - - - - - - - - - -|- - - - | - - - -| With retpolines, netdev drop | 10.420 | 278 | | With retpolines, dummy lookup function | 7.038 | 412 | -> With retpolines, set lookup | | 134 | | - - - - - - - - - - - - - - - - - - - - - - - -|- - - - | - - - -| This series, retpolines, netdev drop | 10.431 | 278 | | This series, retpolines, dummy lookup function | 7.549 | 384 | -> This series, retpolines, set lookup | ^ +7% | 106-|-' estimated clock cycles for set lookup only are the difference between cycles to hit the dummy lookup function and cycles to drop packets from the netdev hook -- they're now approximately the same with and without retpolines. For context, I also ran the whole set of tests with actual matching. This is indicative, just a single run: --------------.-----------------------------------.--------------------------. AMD Epyc 7351 | baselines, Mpps | this series | 1 thread |___________________________________|__________________________| 2.9GHz | | | | | | | | 512KiB L1D$ | netdev | hash | rbtree | | hash | rbtree | | --------------| hook | no | single | | no | single | | type entries | drop | ranges | field | pipapo | ranges | field | pipapo | --------------|--------|--------|--------|--------|--------|-----------------| net,port | | | | | +15% | +4% | +4% | 1000 | 10.1 | 5.2 | 2.7 | 4.6 | 6.0 | 2.8 | 4.8 | --------------|--------|--------|--------|--------|--------|--------|--------| port,net | | | | | +11% | +5% | +4% | 100 | 10.4 | 5.4 | 4.1 | 5.0 | 6.0 | 4.3 | 5.2 | --------------|--------|--------|--------|--------|--------|--------|--------| net6,port | | | | | +15% | +9% | +6% | 1000 | 10.0 | 4.6 | 1.1 | 3.1 | 9.9 | 1.2 | 3.3 | --------------|--------|--------|--------|--------|--------|--------|--------| port,proto | | | | | +7% | +3% | +3% | 10000 | 10.7 | 6.0 | 3.0 | 3.0 | 6.4 | 3.1 | 3.1 | --------------|--------|--------|--------|--------|--------|--------|--------| net6,port,mac | | | | | +3% | +4% | +3% | 10 | 9.9 | 3.8 | 2.7 | 3.3 | 3.9 | 2.8 | 3.4 | --------------|--------|--------|--------|--------|--------|--------|--------| net6,port,mac, | | | | | +3% | +9% | +4% | proto 1000 | 10.0 | 3.6 | 1.1 | 2.4 | 3.7 | 1.2 | 2.5 | --------------|--------|--------|--------|--------|--------|--------|--------| net,mac | | | | | +6% | +4% | +3% | 1000 | 10.5 | 4.8 | 2.7 | 4.0 | 5.1 | 2.8 | 4.1 | --------------'--------'--------'--------'--------'--------'--------'--------' -- Stefano