On Sat, Oct 08, 2022 at 09:40:04AM -0700, Alexei Starovoitov wrote: > On Sat, Oct 8, 2022 at 6:22 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote: > > > On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > > > > > > > Hi, > > > > > > > > On 9/29/2022 11:22 AM, Alexei Starovoitov wrote: > > > > > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > > > >> Hi, > > > > >> > > > > >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: > > > > >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > > > >>> > > > > >>> Looks like the perf is lost on atomic_inc/dec. > > > > >>> Try a partial revert of mem_alloc. > > > > >>> In particular to make sure > > > > >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") > > > > >>> is reverted and call_rcu is in place, > > > > >>> but percpu counter optimization is still there. > > > > >>> Also please use 'map_perf_test 4'. > > > > >>> I doubt 1000 vs 10240 will make a difference, but still. > > > > >>> > > > > >> I have tried the following two setups: > > > > >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map > > > > >> # Samples: 1M of event 'cycles:ppp' > > > > >> # Event count (approx.): 1041345723234 > > > > >> # > > > > >> # Overhead Command Shared Object Symbol > > > > >> # ........ ............... ........................................... > > > > >> ............................................... > > > > >> # > > > > >> 10.36% map_perf_test [kernel.vmlinux] [k] > > > > >> bpf_map_get_memcg.isra.0 > > > > > That is per-cpu counter and it's consuming 10% ?! > > > > > Something is really odd in your setup. > > > > > A lot of debug configs? > > > > Sorry for the late reply. Just back to work from a long vacation. > > > > > > > > My local .config is derived from Fedora distribution. It indeed has some DEBUG > > > > related configs. Will turn these configs off to check it again :) > > > > >> 9.82% map_perf_test [kernel.vmlinux] [k] > > > > >> bpf_map_kmalloc_node > > > > >> 4.24% map_perf_test [kernel.vmlinux] [k] > > > > >> check_preemption_disabled > > > > > clearly debug build. > > > > > Please use production build. > > > > check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on > > > > Fedora distribution. > > > > >> 2.86% map_perf_test [kernel.vmlinux] [k] > > > > >> htab_map_update_elem > > > > >> 2.80% map_perf_test [kernel.vmlinux] [k] > > > > >> __kmalloc_node > > > > >> 2.72% map_perf_test [kernel.vmlinux] [k] > > > > >> htab_map_delete_elem > > > > >> 2.30% map_perf_test [kernel.vmlinux] [k] > > > > >> memcg_slab_post_alloc_hook > > > > >> 2.21% map_perf_test [kernel.vmlinux] [k] > > > > >> entry_SYSCALL_64 > > > > >> 2.17% map_perf_test [kernel.vmlinux] [k] > > > > >> syscall_exit_to_user_mode > > > > >> 2.12% map_perf_test [kernel.vmlinux] [k] jhash > > > > >> 2.11% map_perf_test [kernel.vmlinux] [k] > > > > >> syscall_return_via_sysret > > > > >> 2.05% map_perf_test [kernel.vmlinux] [k] > > > > >> alloc_htab_elem > > > > >> 1.94% map_perf_test [kernel.vmlinux] [k] > > > > >> _raw_spin_lock_irqsave > > > > >> 1.92% map_perf_test [kernel.vmlinux] [k] > > > > >> preempt_count_add > > > > >> 1.92% map_perf_test [kernel.vmlinux] [k] > > > > >> preempt_count_sub > > > > >> 1.87% map_perf_test [kernel.vmlinux] [k] > > > > >> call_rcu > > > > SNIP > > > > >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do > > > > >> you think ? > > > > > We've discussed it twice already. It's not an option due to OOM > > > > > and performance considerations. > > > > > call_rcu doesn't scale to millions a second. > > > > Understand. I was just trying to understand the exact performance overhead of > > > > call_rcu(). If the overhead of map operations are much greater than the overhead > > > > of call_rcu(), I think calling call_rcu() one millions a second will be not a > > > > problem and it also makes the implementation of qp-trie being much simpler. The > > > > OOM problem is indeed a problem, although it is also possible for the current > > > > implementation, so I will try to implement the lookup procedure which handles > > > > the reuse problem. > > > > > > call_rcu is not just that particular function. > > > It's all the work rcu subsystem needs to do to observe gp > > > and execute that callback. Just see how many kthreads it will > > > start when overloaded like this. > > > > The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*, > > and rcuo*. There is also the back-of-interrupt softirq context, which > > requires some care to measure accurately. > > > > The possibility of SLAB_TYPESAFE_BY_RCU has been discussed. I take it > > that the per-element locking overhead for exact iterations was a problem? > > If so, what exactly are the consistency rules for iteration? Presumably > > stronger than "if the element existed throughout, it is included in the > > iteration; if it did not exist throughout, it is not included; otherwise > > it might or might not be included" given that you get that for free. > > > > Either way, could you please tell me the exact iteration rules? > > The rules are the way we make them to be. > iteration will be under lock. > lookup needs to be correct. It can retry if necessary (like htab is doing). > Randomly returning 'noexist' is, of course, not acceptable. OK, so then it is important that updates to this data structure be carried out in such a way as to avoid discombobulating lockless readers. Do the updates have that property? The usual way to get that property is to leave the old search structure around, replacing it with the new one, and RCU-freeing the old one. In case it helps, Kung and Lehman describe how to do that for search trees: http://www.eecs.harvard.edu/~htk/publication/1980-tods-kung-lehman.pdf Thanx, Paul