Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 7 Jun 2023 16:50:35 -0700

On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>
> On Wed, Jun 07, 2023 at 04:23:20PM -0700, Alexei Starovoitov wrote:
> > On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > >
> > > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > > > As said in the commit message, the command line for test is
> > > > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > > > using default max_entries and the number of CPUs is greater than 15,
> > > > > use_percpu_counter will be false.
> > > >
> > > > Right. percpu or not depends on number of cpus.
> > > >
> > > > >
> > > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > > > there are obvious performance degradation.
> > > > ...
> > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > 2:hash_map_perf kmalloc 359201 events per sec
> > > > ..
> > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > 4:hash_map_perf kmalloc 203983 events per sec
> > > >
> > > > this is indeed a degration in a VM.
> > > >
> > > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > > > performance degradation for "./map_perf_test 4 8 16384"
> > > >
> > > > but... a degradation?
> > > >
> > > > > Before reuse-after-rcu-gp:
> > > > >
> > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > 1:hash_map_perf kmalloc 388088 events per sec
> > > > ...
> > > > > After reuse-after-rcu-gp:
> > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > 5:hash_map_perf kmalloc 655628 events per sec
> > > >
> > > > This is a big improvement :) Not a degration.
> > > > You always have to double check the numbers with perf report.
> > > >
> > > > > So could you please double check your setup and rerun map_perf_test ? If
> > > > > there is no performance degradation, could you please share your setup
> > > > > and your kernel configure file ?
> > > >
> > > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > > > Playing with it a bit more I found something interesting:
> > > > map_perf_test 4 8 16348
> > > > before/after has too much noise to be conclusive.
> > > >
> > > > So I did
> > > > map_perf_test 4 8 16348 1000000
> > > >
> > > > and now I see significant degration from patch 3.
> > > > It drops from 800k to 200k.
> > > > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > > > The following hack addresses most of the perf degradtion:
> > > >
> > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > > > index fea1cb0c78bb..eeadc9359097 100644
> > > > --- a/kernel/bpf/memalloc.c
> > > > +++ b/kernel/bpf/memalloc.c
> > > > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > >         alloc = 0;
> > > >         head = NULL;
> > > >         tail = NULL;
> > > > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > > > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> > > >         while (alloc < cnt) {
> > > >                 obj = __llist_del_first(&sc->reuse_ready_head);
> > > >                 if (obj) {
> > > > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > >                 alloc++;
> > > >         }
> > > >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > +       }
> > > >
> > > >         if (alloc) {
> > > >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> > > >                 sc->reuse_ready_tail = NULL;
> > > >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> > > >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > > > +       } else {
> > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > >         }
> > > > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > >  }
> > > >
> > > > It now drops from 800k to 450k.
> > > > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > > > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> > >
> > > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> >
> > An update..
> >
> > I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> > the lock contention, but cache miss on
> > __llist_del_first(&c->reuse_ready_head);
> > was still very high and performance was still at 450k as
> > with a simple hack above.
> >
> > Then I removed some of the _tail optimizations and added counters
> > to these llists.
> > To my surprise
> > map_perf_test 4 1 16348 1000000
> > was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> > and ~400k sitting in reuse_ready_head.
> >
> > Then noticed that we should be doing:
> > call_rcu_hurry(&c->rcu, reuse_rcu);
> > instead of call_rcu(),
> > but my config didn't have RCU_LAZY, so that didn't help.
> > Obviously we cannot allow such a huge number of elements to sit
> > in these link lists.
> > The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
> > To unblock qp-trie work I suggest to add rcu_head to each inner node
> > and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> > Explicit call_rcu would disqualify qp-tree from tracing programs though :(
>
> I am sure that you guys have already considered and discarded this one,
> but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.

SLAB_TYPESAFE_BY_RCU is what bpf_mem_alloc is doing right now.
We want to add an option to make it not do it and instead observe RCU GP
for every element freed via bpf_mem_free().
In other words, make bpf_mem_free() behave like kfree_rcu.
I just tried to use rcu_expedite_gp() before bpf prog runs
and it helps a bit.