Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 7 Jun 2023 13:50:05 -0700

On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > As said in the commit message, the command line for test is
> > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > using default max_entries and the number of CPUs is greater than 15,
> > use_percpu_counter will be false.
>
> Right. percpu or not depends on number of cpus.
>
> >
> > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > there are obvious performance degradation.
> ...
> > [root@hello bpf]# ./map_perf_test 4 8 16384
> > 2:hash_map_perf kmalloc 359201 events per sec
> ..
> > [root@hello bpf]# ./map_perf_test 4 8 16384
> > 4:hash_map_perf kmalloc 203983 events per sec
>
> this is indeed a degration in a VM.
>
> > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > performance degradation for "./map_perf_test 4 8 16384"
>
> but... a degradation?
>
> > Before reuse-after-rcu-gp:
> >
> > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > 1:hash_map_perf kmalloc 388088 events per sec
> ...
> > After reuse-after-rcu-gp:
> > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > 5:hash_map_perf kmalloc 655628 events per sec
>
> This is a big improvement :) Not a degration.
> You always have to double check the numbers with perf report.
>
> > So could you please double check your setup and rerun map_perf_test ? If
> > there is no performance degradation, could you please share your setup
> > and your kernel configure file ?
>
> I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> Playing with it a bit more I found something interesting:
> map_perf_test 4 8 16348
> before/after has too much noise to be conclusive.
>
> So I did
> map_perf_test 4 8 16348 1000000
>
> and now I see significant degration from patch 3.
> It drops from 800k to 200k.
> And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> The following hack addresses most of the perf degradtion:
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index fea1cb0c78bb..eeadc9359097 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>         alloc = 0;
>         head = NULL;
>         tail = NULL;
> -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
>         while (alloc < cnt) {
>                 obj = __llist_del_first(&sc->reuse_ready_head);
>                 if (obj) {
> @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
>                 alloc++;
>         }
>         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> +       }
>
>         if (alloc) {
>                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
>                 sc->reuse_ready_tail = NULL;
>                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
>                 __llist_add_batch(head, tail, &sc->wait_for_free);
> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> +       } else {
> +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>         }
> -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
>  }
>
> It now drops from 800k to 450k.
> And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.

Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.

Also noticed that the overhead of shared reuse_ready list
comes both from the contended lock and from cache misses
when one cpu pushes to the list after RCU GP and another
cpu removes.

Also low/batch/high watermark are all wrong in patch 3.
low=32 and high=96 makes no sense when it's not a single list.
I'm experimenting with 32 for all three heuristics.

Another thing I noticed that per-cpu prepare_reuse and free_by_rcu
are redundant.
unit_free() can push into free_by_rcu directly
then reuse_bulk() can fill it up with free_llist_extra and
move them into waiting_for_gp.

All these _tail optimizations are obscuring the code and make it hard
to notice these issues.

> For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.
>
> Answering your other email:
>
> > I see your point. I will continue to debug the memory usage difference
> > between v3 and v4.
>
> imo it's a waste of time to continue analyzing performance based on bench in patch 2.
>
> > I don't think so. Let's considering the per-cpu list first. Assume the
> > normal RCU grace period is about 30ms and we are tracing the IO latency
> > of a normal SSD. The iops is about 176K per seconds, so before one RCU
> > GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> > For the per-ma list, when the number of CPUs increased, it is easy to
> > make the list contain thousands of elements.
>
> That would be true only if there were no scheduling events in all of 176K ops.
> Which is not the case.
> I'm not sure why you're saying that RCU GP is 30ms.
> In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
> Every sched event is sort-of implicit rcu_read_lock/unlock.
> Network and block IO doesn't process 176K packets without resched.
> Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.
>
> For small size buckets low_watermark=32 and high=96.
> We typically move 32 elements at a time from one list to another.
> A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
> is not instant, but once __free_rcu_tasks_trace is called we need to take
> elements from waiting_for_gp one at a time and kfree it one at a time.
> So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.
>
> > Before I post v5, I want to know the reason why per-bpf-ma list is
> >introduced. Previously, I though it was used to handle the case in which
> > allocation and freeing are done on different CPUs.
>
> Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.
>
> > And as we can see
> > from the benchmark result above and in v3, the performance and the
> > memory usage of v4 for add_del_on_diff_cpu is better than v3.
>
> bench from patch 2 is invalid. Hence no conclusion can be made.
>
> So far the only bench we can trust and analyze is map_perf_test.
> Please make bench in patch 2 yield the cpu after few updates.
> Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
> 64 updates is realistic too. A thousand is not.