Hi, On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: > On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >> A quick benchmark show the performance is bad when using subtree lock for lookup: >> >> Randomly-generated binary data (key size=255, max entries=16K, key length >> range:[1, 255]) >> * no lock >> qp-trie lookup (1 thread) 10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (2 thread) 20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (4 thread) 41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (8 thread) 82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (16 thread) 162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem >> 0.000 MiB) >> >> * subtree lock >> qp-trie lookup (1 thread) 8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (2 thread) 15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (4 thread) 27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (8 thread) 42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem >> 0.000 MiB) >> qp-trie lookup (16 thread) 50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem >> 0.000 MiB) > That's indeed significant. > But I interpret it differently. > Since single thread perf is close enough while 16 thread > suffers 3x it means the lock mechanism is inefficient. > It means update/delete performance equally doesn't scale. > >> Strings in /proc/kallsyms (key size=83, max entries=170958) >> * no lock >> qp-trie lookup (1 thread) 4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem >> 0.000 MiB) >> >> * subtree lock >> qp-trie lookup (1 thread) 4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem >> 0.000 MiB) > Here a single thread with spin_lock is _faster_ than without. It is a little strange because there is not any overhead for lockless lookup for now. I thought it may be due to CPU boost or something similar. Will check it again. > So it's not about the cost of spin_lock, but its contention. > So all the complexity to do lockless lookup > needs to be considered in this context. > Looks like update/delete don't scale anyway. Yes. > So lock-less lookup complexity is justified only > for the case with a lot of concurrent lookups and > little update/delete. > When bpf hash map was added the majority of tracing use cases > had # of lookups == # of updates == # of deletes. > For qp-trie we obviously cannot predict, > but should not pivot strongly into lock-less lookup without data. > The lock-less lookup should have reasonable complexity. > Otherwise let's do spin_lock and work on a different locking scheme > for all operations (lookup/update/delete). For lpm-trie, does the use case in cillium [0] has many lockless lookups and few updates/deletions ? So I think the use case is applicable for qp-trie as well. Before steering towards spin_lock, let's implement a demo for lock-less lookup to see its complexity. For spin-lock way, I had implemented a hand-over-hand lock but the lookup was still lockless, but it doesn't scale well as well. > >> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my >> local environment, could you share the setup for it ? >> >> The following is the output of perf report (--no-children) for "./map_perf_test >> 4 72 10240 100000" on a x86-64 host with 72-cpus: >> >> 26.63% map_perf_test [kernel.vmlinux] [k] >> alloc_htab_elem >> 21.57% map_perf_test [kernel.vmlinux] [k] >> htab_map_update_elem > Looks like the perf is lost on atomic_inc/dec. > Try a partial revert of mem_alloc. > In particular to make sure > commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") > is reverted and call_rcu is in place, > but percpu counter optimization is still there. > Also please use 'map_perf_test 4'. > I doubt 1000 vs 10240 will make a difference, but still. Will do. The suggestion is reasonable. But when max_entries=1000, use_percpu_counter will be false.