Re: [PATCH bpf-next 1/3] bpf: Free elements after one RCU-tasks-trace grace period

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Tue, 11 Oct 2022 23:36:07 -0700

On Tue, Oct 11, 2022 at 07:31:28PM +0800, Hou Tao wrote:
> Hi,
> 
> On 10/11/2022 5:07 PM, Paul E. McKenney wrote:
> > On Tue, Oct 11, 2022 at 03:11:26PM +0800, Hou Tao wrote:
> >> From: Hou Tao <houtao1@xxxxxxxxxx>
> >>
> >> According to the implementation of RCU Tasks Trace, it inovkes
> >> ->postscan_func() to wait for one RCU-tasks-trace grace period and
> >> rcu_tasks_trace_postscan() inovkes synchronize_rcu() to wait for one
> >> normal RCU grace period in turn, so one RCU-tasks-trace grace period
> >> will imply one RCU grace period.
> >>
> >> So there is no need to do call_rcu() again in the callback of
> >> call_rcu_tasks_trace() and it can just free these elements directly.
> > This is true, but this is an implementation detail that is not guaranteed
> > in future versions of the kernel.  But if this additional call_rcu()
> > is causing trouble, I could add some API member that returned true in
> > kernels where it does happen to be the case that call_rcu_tasks_trace()
> > implies a call_rcu()-style grace period.
> >
> > The BPF memory allocator could then complain or adapt, as appropriate.
> >
> > Thoughts?
> It is indeed an implementation details. But In an idle KVM guest, for memory
> reclamation in bpf memory allocator a RCU tasks trace grace period is about 30ms
> and RCU grace period is about 20 ms. Under stress condition, the grace period
> will be much longer. If the extra RCU grace period can be removed, these memory
> can be reclaimed more quickly and it will be beneficial for memory pressure.

I understand the benefits.  We just need to get a safe way to take
advantage of them.

> Now it seems we can use RCU poll APIs (e.g. get_state_synchronize_rcu() and
> poll_state_synchronize_rcu() ) to check whether or not a RCU grace period has
> passed. But It needs to add at least one unsigned long into the freeing object.
> The extra memory overhead may be OK for bpf memory allocator, but it is not for
> small object. So could you please show example on how these new APIs work ? Does
> it need to modify the to-be-free object ?

Good point on the polling APIs, more on this below.

I was thinking in terms of an API like this:

	static inline bool rcu_trace_implies_rcu_gp(void)
	{
		return true;
	}

Along with comments on the synchronize_rcu() pointing at the
rcu_trace_implies_rcu_gp().

Another approach is to wait for the grace periods concurrently, but this
requires not one but two rcu_head structures.

Back to the polling API.  Are these things freed individually, or can
they be grouped?  If they can be grouped, the storage for the grace-period
state could be associated with the group.

							Thanx, Paul

> >> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
> >> ---
> >>  kernel/bpf/memalloc.c | 17 ++++++-----------
> >>  1 file changed, 6 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> >> index 5f83be1d2018..6f32dddc804f 100644
> >> --- a/kernel/bpf/memalloc.c
> >> +++ b/kernel/bpf/memalloc.c
> >> @@ -209,6 +209,9 @@ static void free_one(struct bpf_mem_cache *c, void *obj)
> >>  	kfree(obj);
> >>  }
> >>  
> >> +/* Now RCU Tasks grace period implies RCU grace period, so no need to do
> >> + * an extra call_rcu().
> >> + */
> >>  static void __free_rcu(struct rcu_head *head)
> >>  {
> >>  	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
> >> @@ -220,13 +223,6 @@ static void __free_rcu(struct rcu_head *head)
> >>  	atomic_set(&c->call_rcu_in_progress, 0);
> >>  }
> >>  
> >> -static void __free_rcu_tasks_trace(struct rcu_head *head)
> >> -{
> >> -	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
> >> -
> >> -	call_rcu(&c->rcu, __free_rcu);
> >> -}
> >> -
> >>  static void enque_to_free(struct bpf_mem_cache *c, void *obj)
> >>  {
> >>  	struct llist_node *llnode = obj;
> >> @@ -252,11 +248,10 @@ static void do_call_rcu(struct bpf_mem_cache *c)
> >>  		 * from __free_rcu() and from drain_mem_cache().
> >>  		 */
> >>  		__llist_add(llnode, &c->waiting_for_gp);
> >> -	/* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> >> -	 * Then use call_rcu() to wait for normal progs to finish
> >> -	 * and finally do free_one() on each element.
> >> +	/* Use call_rcu_tasks_trace() to wait for both sleepable and normal
> >> +	 * progs to finish and finally do free_one() on each element.
> >>  	 */
> >> -	call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace);
> >> +	call_rcu_tasks_trace(&c->rcu, __free_rcu);
> >>  }
> >>  
> >>  static void free_bulk(struct bpf_mem_cache *c)
> >> -- 
> >> 2.29.2
> >>
>