On Wed, Jun 28, 2023 at 10:57 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > On Tue, Jun 27, 2023 at 06:56:33PM -0700, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu(). > > Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into > > per-cpu free list the _rcu() flavor waits for RCU grace period and then moves > > objects into free_by_rcu_ttrace list where they are waiting for RCU > > task trace grace period to be freed into slab. > > > > The life cycle of objects: > > alloc: dequeue free_llist > > free: enqeueu free_llist > > free_rcu: enqueue free_by_rcu -> waiting_for_gp > > free_llist above high watermark -> free_by_rcu_ttrace > > after RCU GP waiting_for_gp -> free_by_rcu_ttrace > > free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab > > > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> > > --- > > include/linux/bpf_mem_alloc.h | 2 + > > kernel/bpf/memalloc.c | 129 +++++++++++++++++++++++++++++++++- > > 2 files changed, 128 insertions(+), 3 deletions(-) > > > > diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h > > index 3929be5743f4..d644bbb298af 100644 > > --- a/include/linux/bpf_mem_alloc.h > > +++ b/include/linux/bpf_mem_alloc.h > > @@ -27,10 +27,12 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); > > /* kmalloc/kfree equivalent: */ > > void *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size); > > void bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr); > > +void bpf_mem_free_rcu(struct bpf_mem_alloc *ma, void *ptr); > > > > /* kmem_cache_alloc/free equivalent: */ > > void *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma); > > void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr); > > +void bpf_mem_cache_free_rcu(struct bpf_mem_alloc *ma, void *ptr); > > void bpf_mem_cache_raw_free(void *ptr); > > void *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags); > > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > > index 40524d9454c7..3081d06a434c 100644 > > --- a/kernel/bpf/memalloc.c > > +++ b/kernel/bpf/memalloc.c > > @@ -101,6 +101,15 @@ struct bpf_mem_cache { > > bool draining; > > struct bpf_mem_cache *tgt; > > > > + /* list of objects to be freed after RCU GP */ > > + struct llist_head free_by_rcu; > > + struct llist_node *free_by_rcu_tail; > > + struct llist_head waiting_for_gp; > > + struct llist_node *waiting_for_gp_tail; > > + struct rcu_head rcu; > > + atomic_t call_rcu_in_progress; > > + struct llist_head free_llist_extra_rcu; > > + > > /* list of objects to be freed after RCU tasks trace GP */ > > struct llist_head free_by_rcu_ttrace; > > struct llist_head waiting_for_gp_ttrace; > > @@ -344,6 +353,69 @@ static void free_bulk(struct bpf_mem_cache *c) > > do_call_rcu_ttrace(tgt); > > } > > > > +static void __free_by_rcu(struct rcu_head *head) > > +{ > > + struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu); > > + struct bpf_mem_cache *tgt = c->tgt; > > + struct llist_node *llnode; > > + > > + llnode = llist_del_all(&c->waiting_for_gp); > > + if (!llnode) > > + goto out; > > + > > + llist_add_batch(llnode, c->waiting_for_gp_tail, &tgt->free_by_rcu_ttrace); > > + > > + /* Objects went through regular RCU GP. Send them to RCU tasks trace */ > > + do_call_rcu_ttrace(tgt); > > +out: > > + atomic_set(&c->call_rcu_in_progress, 0); > > +} > > + > > +static void check_free_by_rcu(struct bpf_mem_cache *c) > > +{ > > + struct llist_node *llnode, *t; > > + unsigned long flags; > > + > > + /* drain free_llist_extra_rcu */ > > + if (unlikely(!llist_empty(&c->free_llist_extra_rcu))) { > > + inc_active(c, &flags); > > + llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra_rcu)) > > + if (__llist_add(llnode, &c->free_by_rcu)) > > + c->free_by_rcu_tail = llnode; > > + dec_active(c, flags); > > + } > > + > > + if (llist_empty(&c->free_by_rcu)) > > + return; > > + > > + if (atomic_xchg(&c->call_rcu_in_progress, 1)) { > > + /* > > + * Instead of kmalloc-ing new rcu_head and triggering 10k > > + * call_rcu() to hit rcutree.qhimark and force RCU to notice > > + * the overload just ask RCU to hurry up. There could be many > > + * objects in free_by_rcu list. > > + * This hint reduces memory consumption for an artifical > > + * benchmark from 2 Gbyte to 150 Mbyte. > > + */ > > + rcu_request_urgent_qs_task(current); > > I have been going back and forth on whether rcu_request_urgent_qs_task() > needs to throttle calls to itself, for example, to pay attention to only > one invocation per jiffy. The theory here is that RCU's state machine > normally only advances about once per jiffy anyway. > > The main risk of *not* throttling is if several CPUs were to invoke > rcu_request_urgent_qs_task() in tight loops while those same CPUs were > undergoing interrupt storms, which would result in heavy lock contention > in __rcu_irq_enter_check_tick(). This is not exactly a common-case > scenario, but on the other hand, if you are having this degree of trouble, > should RCU really be adding lock contention to your troubles? I see spin_lock in __rcu_irq_enter_check_tick(), but I didn't observe it in practice even when I was calling rcu_request_urgent_qs_task() in multiple places through bpf_mem_alloc. I left it only in one place (this patch), because it was enough to 'hurry up the RCU' and make the difference. rdp = this_cpu_ptr(&rcu_data); is percpu, so I'm not sure why you think that the contention is possible. I think we should avoid extra logic either in RCU or in bpf_mem_alloc to keep the code simple, since contention is hypothetical at this point. I've tried preempt and no preempt configs. With and without debug.