On Thu, Jun 08, 2023 at 09:51:27AM +0800, Hou Tao wrote: > Hi, > > On 6/8/2023 4:50 AM, Alexei Starovoitov wrote: > > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx> wrote: > >> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote: > >>> As said in the commit message, the command line for test is > >>> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If > >>> using default max_entries and the number of CPUs is greater than 15, > >>> use_percpu_counter will be false. > >> Right. percpu or not depends on number of cpus. > >> > >>> I have double checked my local VM setup (8 CPUs + 16GB) and rerun the > >>> test. For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384" > >>> there are obvious performance degradation. > >> ... > >>> [root@hello bpf]# ./map_perf_test 4 8 16384 > >>> 2:hash_map_perf kmalloc 359201 events per sec > >> .. > >>> [root@hello bpf]# ./map_perf_test 4 8 16384 > >>> 4:hash_map_perf kmalloc 203983 events per sec > >> this is indeed a degration in a VM. > >> > >>> I also run map_perf_test on a physical x86-64 host with 72 CPUs. The > >>> performances for "./map_perf_test 4 8" are similar, but there is obvious > >>> performance degradation for "./map_perf_test 4 8 16384" > >> but... a degradation? > Er, My bad. I miss-labeled "Before" and "After". v4 indeed introduces > big performance degradation in physical host. > >> > >>> Before reuse-after-rcu-gp: > >>> > >>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384 > >>> 1:hash_map_perf kmalloc 388088 events per sec > >> ... > >>> After reuse-after-rcu-gp: > >>> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384 > >>> 5:hash_map_perf kmalloc 655628 events per sec > >> This is a big improvement :) Not a degration. > >> You always have to double check the numbers with perf report. > >> > >>> So could you please double check your setup and rerun map_perf_test ? If > >>> there is no performance degradation, could you please share your setup > >>> and your kernel configure file ? > >> I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000 > >> Playing with it a bit more I found something interesting: > >> map_perf_test 4 8 16348 > >> before/after has too much noise to be conclusive. > >> > >> So I did > >> map_perf_test 4 8 16348 1000000 > >> > >> and now I see significant degration from patch 3. > >> It drops from 800k to 200k. > >> And perf report confirms that heavy contention on sc->reuse_lock is the culprit. > >> The following hack addresses most of the perf degradtion: > >> > >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > >> index fea1cb0c78bb..eeadc9359097 100644 > >> --- a/kernel/bpf/memalloc.c > >> +++ b/kernel/bpf/memalloc.c > >> @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt) > >> alloc = 0; > >> head = NULL; > >> tail = NULL; > >> - raw_spin_lock_irqsave(&sc->reuse_lock, flags); > >> + if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) { > >> while (alloc < cnt) { > >> obj = __llist_del_first(&sc->reuse_ready_head); > >> if (obj) { > >> @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt) > >> alloc++; > >> } > >> raw_spin_unlock_irqrestore(&sc->reuse_lock, flags); > >> + } > >> > >> if (alloc) { > >> if (IS_ENABLED(CONFIG_PREEMPT_RT)) > >> @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c) > >> sc->reuse_ready_tail = NULL; > >> WARN_ON_ONCE(!llist_empty(&sc->wait_for_free)); > >> __llist_add_batch(head, tail, &sc->wait_for_free); > >> + raw_spin_unlock_irqrestore(&sc->reuse_lock, flags); > >> call_rcu_tasks_trace(&sc->rcu, free_rcu); > >> + } else { > >> + raw_spin_unlock_irqrestore(&sc->reuse_lock, flags); > >> } > >> - raw_spin_unlock_irqrestore(&sc->reuse_lock, flags); > >> } > >> > >> It now drops from 800k to 450k. > >> And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree. > >> So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists. > > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already. > > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp. > Yes, I known, because I had just proposed it in the email yesterday. > > > > Also noticed that the overhead of shared reuse_ready list > > comes both from the contended lock and from cache misses > > when one cpu pushes to the list after RCU GP and another > > cpu removes. > > > > Also low/batch/high watermark are all wrong in patch 3. > > low=32 and high=96 makes no sense when it's not a single list. > > I'm experimenting with 32 for all three heuristics. > > > > Another thing I noticed that per-cpu prepare_reuse and free_by_rcu > > are redundant. > > unit_free() can push into free_by_rcu directly > > then reuse_bulk() can fill it up with free_llist_extra and > > move them into waiting_for_gp. > Yes. Indeed missing that. > > > > All these _tail optimizations are obscuring the code and make it hard > > to notice these issues. > > > >> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization. > >> > >> Answering your other email: > >> > >>> I see your point. I will continue to debug the memory usage difference > >>> between v3 and v4. > >> imo it's a waste of time to continue analyzing performance based on bench in patch 2. > Don't agree with that. I still think the potential memory usage of v4 is > a problem and the difference memory usage between v3 and v4 reveals that > there is some peculiarity in RCU subsystem, because the difference comes > from the duration of RCU grace period. We need to find out why and fix > or workaround that. A tight loop in the kernel can extend RCU grace periods, especially for kernels built with CONFIG_PREEPTION=n. Placing things like cond_resched() in such loops can help. Of course, if you are in an RCU read-side critical section (or holding a spinlock), you will need to exit, cond_resched(), then re-enter. Taking care to ensure that the state upon re-entry is valid. After all, having exited either type of critical section, anything might happen. Thanx, Paul > >>> I don't think so. Let's considering the per-cpu list first. Assume the > >>> normal RCU grace period is about 30ms and we are tracing the IO latency > >>> of a normal SSD. The iops is about 176K per seconds, so before one RCU > >>> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements. > >>> For the per-ma list, when the number of CPUs increased, it is easy to > >>> make the list contain thousands of elements. > >> That would be true only if there were no scheduling events in all of 176K ops. > >> Which is not the case. > >> I'm not sure why you're saying that RCU GP is 30ms. > Because these freed elements will be freed after one RCU GP and in v4 > there is only one RCU callback per-cpu, so before one RCU GP is expired, > these freed elements will be accumulated on the list. > >> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops. > >> Every sched event is sort-of implicit rcu_read_lock/unlock. > >> Network and block IO doesn't process 176K packets without resched. > >> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq. > >> > >> For small size buckets low_watermark=32 and high=96. > >> We typically move 32 elements at a time from one list to another. > >> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp > >> is not instant, but once __free_rcu_tasks_trace is called we need to take > >> elements from waiting_for_gp one at a time and kfree it one at a time. > >> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity. > >> > >>> Before I post v5, I want to know the reason why per-bpf-ma list is > >>> introduced. Previously, I though it was used to handle the case in which > >>> allocation and freeing are done on different CPUs. > >> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP. > >> > >>> And as we can see > >>> from the benchmark result above and in v3, the performance and the > >>> memory usage of v4 for add_del_on_diff_cpu is better than v3. > >> bench from patch 2 is invalid. Hence no conclusion can be made. > >> > >> So far the only bench we can trust and analyze is map_perf_test. > >> Please make bench in patch 2 yield the cpu after few updates. > >> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time. > >> 64 updates is realistic too. A thousand is not. > Will do that. >