On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > A quick benchmark show the performance is bad when using subtree lock for lookup: > > Randomly-generated binary data (key size=255, max entries=16K, key length > range:[1, 255]) > * no lock > qp-trie lookup (1 thread) 10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (2 thread) 20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (4 thread) 41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (8 thread) 82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (16 thread) 162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem > 0.000 MiB) > > * subtree lock > qp-trie lookup (1 thread) 8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (2 thread) 15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (4 thread) 27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (8 thread) 42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem > 0.000 MiB) > qp-trie lookup (16 thread) 50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem > 0.000 MiB) That's indeed significant. But I interpret it differently. Since single thread perf is close enough while 16 thread suffers 3x it means the lock mechanism is inefficient. It means update/delete performance equally doesn't scale. > Strings in /proc/kallsyms (key size=83, max entries=170958) > * no lock > qp-trie lookup (1 thread) 4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem > 0.000 MiB) > > * subtree lock > qp-trie lookup (1 thread) 4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem > 0.000 MiB) Here a single thread with spin_lock is _faster_ than without. So it's not about the cost of spin_lock, but its contention. So all the complexity to do lockless lookup needs to be considered in this context. Looks like update/delete don't scale anyway. So lock-less lookup complexity is justified only for the case with a lot of concurrent lookups and little update/delete. When bpf hash map was added the majority of tracing use cases had # of lookups == # of updates == # of deletes. For qp-trie we obviously cannot predict, but should not pivot strongly into lock-less lookup without data. The lock-less lookup should have reasonable complexity. Otherwise let's do spin_lock and work on a different locking scheme for all operations (lookup/update/delete). > I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my > local environment, could you share the setup for it ? > > The following is the output of perf report (--no-children) for "./map_perf_test > 4 72 10240 100000" on a x86-64 host with 72-cpus: > > 26.63% map_perf_test [kernel.vmlinux] [k] > alloc_htab_elem > 21.57% map_perf_test [kernel.vmlinux] [k] > htab_map_update_elem Looks like the perf is lost on atomic_inc/dec. Try a partial revert of mem_alloc. In particular to make sure commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") is reverted and call_rcu is in place, but percpu counter optimization is still there. Also please use 'map_perf_test 4'. I doubt 1000 vs 10240 will make a difference, but still.