Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Wed, 7 Jun 2023 18:15:58 -0700

On Wed, Jun 07, 2023 at 05:34:50PM -0700, Alexei Starovoitov wrote:
> On Wed, Jun 7, 2023 at 5:13 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote:
> > > On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Jun 07, 2023 at 04:23:20PM -0700, Alexei Starovoitov wrote:
> > > > > On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
> > > > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> > > > > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > > > > > > As said in the commit message, the command line for test is
> > > > > > > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > > > > > > using default max_entries and the number of CPUs is greater than 15,
> > > > > > > > use_percpu_counter will be false.
> > > > > > >
> > > > > > > Right. percpu or not depends on number of cpus.
> > > > > > >
> > > > > > > >
> > > > > > > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > > > > > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > > > > > > there are obvious performance degradation.
> > > > > > > ...
> > > > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > > > 2:hash_map_perf kmalloc 359201 events per sec
> > > > > > > ..
> > > > > > > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > > > > > > 4:hash_map_perf kmalloc 203983 events per sec
> > > > > > >
> > > > > > > this is indeed a degration in a VM.
> > > > > > >
> > > > > > > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > > > > > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > > > > > > performance degradation for "./map_perf_test 4 8 16384"
> > > > > > >
> > > > > > > but... a degradation?
> > > > > > >
> > > > > > > > Before reuse-after-rcu-gp:
> > > > > > > >
> > > > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > > > 1:hash_map_perf kmalloc 388088 events per sec
> > > > > > > ...
> > > > > > > > After reuse-after-rcu-gp:
> > > > > > > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > > > > > > 5:hash_map_perf kmalloc 655628 events per sec
> > > > > > >
> > > > > > > This is a big improvement :) Not a degration.
> > > > > > > You always have to double check the numbers with perf report.
> > > > > > >
> > > > > > > > So could you please double check your setup and rerun map_perf_test ? If
> > > > > > > > there is no performance degradation, could you please share your setup
> > > > > > > > and your kernel configure file ?
> > > > > > >
> > > > > > > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > > > > > > Playing with it a bit more I found something interesting:
> > > > > > > map_perf_test 4 8 16348
> > > > > > > before/after has too much noise to be conclusive.
> > > > > > >
> > > > > > > So I did
> > > > > > > map_perf_test 4 8 16348 1000000
> > > > > > >
> > > > > > > and now I see significant degration from patch 3.
> > > > > > > It drops from 800k to 200k.
> > > > > > > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > > > > > > The following hack addresses most of the perf degradtion:
> > > > > > >
> > > > > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > > > > > > index fea1cb0c78bb..eeadc9359097 100644
> > > > > > > --- a/kernel/bpf/memalloc.c
> > > > > > > +++ b/kernel/bpf/memalloc.c
> > > > > > > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > > > >         alloc = 0;
> > > > > > >         head = NULL;
> > > > > > >         tail = NULL;
> > > > > > > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > > > > > > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> > > > > > >         while (alloc < cnt) {
> > > > > > >                 obj = __llist_del_first(&sc->reuse_ready_head);
> > > > > > >                 if (obj) {
> > > > > > > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> > > > > > >                 alloc++;
> > > > > > >         }
> > > > > > >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > > +       }
> > > > > > >
> > > > > > >         if (alloc) {
> > > > > > >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > > > > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> > > > > > >                 sc->reuse_ready_tail = NULL;
> > > > > > >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> > > > > > >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > > > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > > > > > > +       } else {
> > > > > > > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > >         }
> > > > > > > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > > > > > >  }
> > > > > > >
> > > > > > > It now drops from 800k to 450k.
> > > > > > > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > > > > > > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
> > > > > >
> > > > > > Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> > > > > > I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.
> > > > >
> > > > > An update..
> > > > >
> > > > > I tweaked patch 3 to do per-cpu reuse_ready and it addressed
> > > > > the lock contention, but cache miss on
> > > > > __llist_del_first(&c->reuse_ready_head);
> > > > > was still very high and performance was still at 450k as
> > > > > with a simple hack above.
> > > > >
> > > > > Then I removed some of the _tail optimizations and added counters
> > > > > to these llists.
> > > > > To my surprise
> > > > > map_perf_test 4 1 16348 1000000
> > > > > was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
> > > > > and ~400k sitting in reuse_ready_head.
> > > > >
> > > > > Then noticed that we should be doing:
> > > > > call_rcu_hurry(&c->rcu, reuse_rcu);
> > > > > instead of call_rcu(),
> > > > > but my config didn't have RCU_LAZY, so that didn't help.
> > > > > Obviously we cannot allow such a huge number of elements to sit
> > > > > in these link lists.
> > > > > The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
> > > > > To unblock qp-trie work I suggest to add rcu_head to each inner node
> > > > > and do call_rcu() on them before free-ing them to bpf_mem_alloc.
> > > > > Explicit call_rcu would disqualify qp-tree from tracing programs though :(
> > > >
> > > > I am sure that you guys have already considered and discarded this one,
> > > > but I cannot help but suggest SLAB_TYPESAFE_BY_RCU.
> > >
> > > SLAB_TYPESAFE_BY_RCU is what bpf_mem_alloc is doing right now.
> > > We want to add an option to make it not do it and instead observe RCU GP
> > > for every element freed via bpf_mem_free().
> > > In other words, make bpf_mem_free() behave like kfree_rcu.
> > > I just tried to use rcu_expedite_gp() before bpf prog runs
> > > and it helps a bit.
> >
> > OK, got it, so you guys have considered, implemented, and are now trying
> > to discard SLAB_TYPESAFE_BY_RCU.  ;-)
> >
> > Given that you are using call_rcu() / call_rcu_hurry(), I am a bit
> > surprised that rcu_expedite_gp() makes any difference.
> >
> > We do some expediting if there are huge numbers of callbacks or if one
> > of RCU's shrinker notifiers is invoked.  If the concern is only memory
> > footprint, it is possible to make the shrinkers more aggressive.  I am
> > not sure whether making them unconditionally more aggressive is a good
> > idea, however if memory footprint is the only concern and if shrink-time
> > expediting would suffice, it is certainly worth some investigation.
> 
> Right. I don't think it's a good idea to tweak RCU for this use case.
> RCU parameters have to be optimized for all. Instead the bpf side needs
> to understand how RCU heuristics/watermarks work and play that game.
> For example, Hou's patch 3 has one pending call_rcu per-cpu.
> As soon as one call_rcu_hurry is done all future freed elements gets
> queued into llist and for the next call_rcu_hurry() that list will
> contain 100k elements.
> I believe from RCU pov one pending call_rcu cb is not a reason to
> act right away. It's trying to batch multiple cb-s.

Indeed, if a single RCU callback was that important, it would be necessary
for the caller to explicitly say so somehow.  And I agree that this
would open a can of worms that might be best left unopened.

> Right now I'm experimenting with multiple call_rcu calls from the bpf side,
> so that RCU sees multiple pending cb-s and has to act.
> It seems to work much better. Memory footprint is now reasonable.

Nice!!!

> Could you point me to a code in RCU where it's doing callback batching?

There are quite a few things in RCU that make this happen:

1.	The latency of the previous RCU grace period automatically
	batches up callbacks for the RCU grace period that is just
	now starting.

2.	More specifically, the rcutree.jiffies_till_first_fqs and
	rcutree.jiffies_till_next_fqs module parameters control
	how long RCU grace-period processing waits between checks
	for idle CPUs.

3.	The rcu_implicit_dynticks_qs() function is invoked periodically
	as specified by #2 above, and pays attention to grace-period
	progress.  If it decides that the current grace period is not
	moving along quickly enough, it will enlist help from
	cond_resched(), the context-switch code, and from a
	scheduler IPI.  Failing that, it collects information that
	can be printed out by the ensuing RCU CPU stall warning.

4.	When callbacks are not offloaded to the rcuo kthreads, when
	the __call_rcu_core() function decides that there are too many
	callbacks piled up on the current CPU, it gives the current grace
	period a kick by invoking rcu_force_quiescent_state(), which
	in turn forces RCU to check for holdout CPUs.  And also which
	forces the current CPU to check immediately for a quiescent state.

5.	The __call_rcu_nocb_wake() does something similar for the case
	where callbacks are offloaded.

	Most datacenter Linux kernels do *not* offload callbacks.

As a rough rule of thumb, RCU wants to keep grace period latency
somewhere in the milliseconds.  If the grace period goes too slowly or
if there are too many callbacks piling up, it will take evasive action
as described above.

Does that help, or am I missing the point of your question?

							Thanx, Paul