On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote: > > As said in the commit message, the command line for test is > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If > > using default max_entries and the number of CPUs is greater than 15, > > use_percpu_counter will be false. > > Right. percpu or not depends on number of cpus. > > > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the > > test. For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384" > > there are obvious performance degradation. > ... > > [root@hello bpf]# ./map_perf_test 4 8 16384 > > 2:hash_map_perf kmalloc 359201 events per sec > .. > > [root@hello bpf]# ./map_perf_test 4 8 16384 > > 4:hash_map_perf kmalloc 203983 events per sec > > this is indeed a degration in a VM. > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The > > performances for "./map_perf_test 4 8" are similar, but there is obvious > > performance degradation for "./map_perf_test 4 8 16384" > > but... a degradation? > > > Before reuse-after-rcu-gp: > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384 > > 1:hash_map_perf kmalloc 388088 events per sec > ... > > After reuse-after-rcu-gp: > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384 > > 5:hash_map_perf kmalloc 655628 events per sec > > This is a big improvement :) Not a degration. > You always have to double check the numbers with perf report. > > > So could you please double check your setup and rerun map_perf_test ? If > > there is no performance degradation, could you please share your setup > > and your kernel configure file ? > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000 > Playing with it a bit more I found something interesting: > map_perf_test 4 8 16348 > before/after has too much noise to be conclusive. > > So I did > map_perf_test 4 8 16348 1000000 > > and now I see significant degration from patch 3. > It drops from 800k to 200k. > And perf report confirms that heavy contention on sc->reuse_lock is the culprit. > The following hack addresses most of the perf degradtion: > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index fea1cb0c78bb..eeadc9359097 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt) > alloc = 0; > head = NULL; > tail = NULL; > - raw_spin_lock_irqsave(&sc->reuse_lock, flags); > + if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) { > while (alloc < cnt) { > obj = __llist_del_first(&sc->reuse_ready_head); > if (obj) { > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt) > alloc++; > } > raw_spin_unlock_irqrestore(&sc->reuse_lock, flags); > + } > > if (alloc) { > if (IS_ENABLED(CONFIG_PREEMPT_RT)) > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c) > sc->reuse_ready_tail = NULL; > WARN_ON_ONCE(!llist_empty(&sc->wait_for_free)); > __llist_add_batch(head, tail, &sc->wait_for_free); > + raw_spin_unlock_irqrestore(&sc->reuse_lock, flags); > call_rcu_tasks_trace(&sc->rcu, free_rcu); > + } else { > + raw_spin_unlock_irqrestore(&sc->reuse_lock, flags); > } > - raw_spin_unlock_irqrestore(&sc->reuse_lock, flags); > } > > It now drops from 800k to 450k. > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree. > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists. Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already. I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp. Also noticed that the overhead of shared reuse_ready list comes both from the contended lock and from cache misses when one cpu pushes to the list after RCU GP and another cpu removes. Also low/batch/high watermark are all wrong in patch 3. low=32 and high=96 makes no sense when it's not a single list. I'm experimenting with 32 for all three heuristics. Another thing I noticed that per-cpu prepare_reuse and free_by_rcu are redundant. unit_free() can push into free_by_rcu directly then reuse_bulk() can fill it up with free_llist_extra and move them into waiting_for_gp. All these _tail optimizations are obscuring the code and make it hard to notice these issues. > For now I still prefer to see v5 with per-bpf-ma and no _tail optimization. > > Answering your other email: > > > I see your point. I will continue to debug the memory usage difference > > between v3 and v4. > > imo it's a waste of time to continue analyzing performance based on bench in patch 2. > > > I don't think so. Let's considering the per-cpu list first. Assume the > > normal RCU grace period is about 30ms and we are tracing the IO latency > > of a normal SSD. The iops is about 176K per seconds, so before one RCU > > GP is passed, we will need to allocate about 176 * 30 = 5.2K elements. > > For the per-ma list, when the number of CPUs increased, it is easy to > > make the list contain thousands of elements. > > That would be true only if there were no scheduling events in all of 176K ops. > Which is not the case. > I'm not sure why you're saying that RCU GP is 30ms. > In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops. > Every sched event is sort-of implicit rcu_read_lock/unlock. > Network and block IO doesn't process 176K packets without resched. > Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq. > > For small size buckets low_watermark=32 and high=96. > We typically move 32 elements at a time from one list to another. > A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp > is not instant, but once __free_rcu_tasks_trace is called we need to take > elements from waiting_for_gp one at a time and kfree it one at a time. > So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity. > > > Before I post v5, I want to know the reason why per-bpf-ma list is > >introduced. Previously, I though it was used to handle the case in which > > allocation and freeing are done on different CPUs. > > Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP. > > > And as we can see > > from the benchmark result above and in v3, the performance and the > > memory usage of v4 for add_del_on_diff_cpu is better than v3. > > bench from patch 2 is invalid. Hence no conclusion can be made. > > So far the only bench we can trust and analyze is map_perf_test. > Please make bench in patch 2 yield the cpu after few updates. > Earlier I suggested to stick to 10, but since NAPI can do 64 at a time. > 64 updates is realistic too. A thousand is not.