Re: [PATCH bpf-next 1/3] bpf: Free elements after one RCU-tasks-trace grace period

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Thu, 13 Oct 2022 12:00:41 -0700

On Thu, Oct 13, 2022 at 09:25:31AM +0800, Hou Tao wrote:
> Hi,
> 
> On 10/13/2022 12:11 AM, Paul E. McKenney wrote:
> > On Wed, Oct 12, 2022 at 05:26:26PM +0800, Hou Tao wrote:
> >> Hi,
> >>
> >> On 10/12/2022 2:36 PM, Paul E. McKenney wrote:
> >>> On Tue, Oct 11, 2022 at 07:31:28PM +0800, Hou Tao wrote:
> >>>> Hi,
> >>>>
> >>>> On 10/11/2022 5:07 PM, Paul E. McKenney wrote:
> >>>>> On Tue, Oct 11, 2022 at 03:11:26PM +0800, Hou Tao wrote:
> >>>>>> From: Hou Tao <houtao1@xxxxxxxxxx>
> >>>>>>
> >>>>>> According to the implementation of RCU Tasks Trace, it inovkes
> >>>>>> ->postscan_func() to wait for one RCU-tasks-trace grace period and
> >>>>>> rcu_tasks_trace_postscan() inovkes synchronize_rcu() to wait for one
> >>>>>> normal RCU grace period in turn, so one RCU-tasks-trace grace period
> >>>>>> will imply one RCU grace period.
> >>>>>>
> >>>>>> So there is no need to do call_rcu() again in the callback of
> >>>>>> call_rcu_tasks_trace() and it can just free these elements directly.
> >>>>> This is true, but this is an implementation detail that is not guaranteed
> >>>>> in future versions of the kernel.  But if this additional call_rcu()
> >>>>> is causing trouble, I could add some API member that returned true in
> >>>>> kernels where it does happen to be the case that call_rcu_tasks_trace()
> >>>>> implies a call_rcu()-style grace period.
> >>>>>
> >>>>> The BPF memory allocator could then complain or adapt, as appropriate.
> >>>>>
> >>>>> Thoughts?
> >>>> It is indeed an implementation details. But In an idle KVM guest, for memory
> >>>> reclamation in bpf memory allocator a RCU tasks trace grace period is about 30ms
> >>>> and RCU grace period is about 20 ms. Under stress condition, the grace period
> >>>> will be much longer. If the extra RCU grace period can be removed, these memory
> >>>> can be reclaimed more quickly and it will be beneficial for memory pressure.
> >>> I understand the benefits.  We just need to get a safe way to take
> >>> advantage of them.
> >>>
> >>>> Now it seems we can use RCU poll APIs (e.g. get_state_synchronize_rcu() and
> >>>> poll_state_synchronize_rcu() ) to check whether or not a RCU grace period has
> >>>> passed. But It needs to add at least one unsigned long into the freeing object.
> >>>> The extra memory overhead may be OK for bpf memory allocator, but it is not for
> >>>> small object. So could you please show example on how these new APIs work ? Does
> >>>> it need to modify the to-be-free object ?
> >>> Good point on the polling APIs, more on this below.
> >>>
> >>> I was thinking in terms of an API like this:
> >>>
> >>> 	static inline bool rcu_trace_implies_rcu_gp(void)
> >>> 	{
> >>> 		return true;
> >>> 	}
> >>>
> >>> Along with comments on the synchronize_rcu() pointing at the
> >>> rcu_trace_implies_rcu_gp().
> >> It is a simple API and the modifications for call_rcu_tasks_trace() users will
> >> also be simple. The callback of call_rcu_tasks_trace() will invoke
> >> rcu_trace_implies_rcu_gp(), and it returns true, the callback invokes the
> >> callback for call_rcu() directly, else it does so through call_rcu().
> > Sounds good!
> >
> > Please note that if the callback function just does kfree() or equivalent,
> > this will work fine.  If it acquires spinlocks, you may need to do
> > local_bh_disable() before invoking it directly and local_bh_enable()
> > afterwards.
> What is the purpose for invoking local_bh_disable() ? Is it trying to ensure the
> callback is called under soft-irq context or something else ? For all I know,
> task rcu already invokes its callback with soft-irq disabled.
> >
> >>> Another approach is to wait for the grace periods concurrently, but this
> >>> requires not one but two rcu_head structures.
> >> Beside the extra space usage, does it also complicate the logic in callback
> >> function because it needs to handle the concurrency problem ?
> > Definitely!!!
> >
> > Perhaps something like this:
> >
> > 	static void cbf(struct rcu_head *rhp)
> > 	{
> > 		p = container_of(rhp, ...);
> >
> > 		if (atomic_dec_and_test(&p->cbs_awaiting))
> > 			kfree(p);
> > 	}
> >
> > 	atomic_set(&p->cbs_awating, 2);
> > 	call_rcu(p->rh1, cbf);
> > 	call_rcu_tasks_trace(p->rh2, cbf);
> >
> > Is this worth it?  I have no idea.  I must defer to you.
> I still prefer the simple solution.
> >
> >>> Back to the polling API.  Are these things freed individually, or can
> >>> they be grouped?  If they can be grouped, the storage for the grace-period
> >>> state could be associated with the group.
> >> As said above, for bpf memory allocator it may be OK because it frees elements
> >> in batch, but for bpf local storage and its element these memories are freed
> >> individually. So I think if the implication of RCU tasks trace grace period will
> >> not be changed in the foreseeable future, adding rcu_trace_implies_rcu_gp() and
> >> using it in bpf is a good idea. What do you think ?
> > Maybe the BPF memory allocator does it one way and BPF local storage
> > does it another way.
> Why not. Maybe bpf expert think the space overhead is also reasonable in the BPF
> local storage case.
> >
> > How about if I produce a patch for rcu_trace_implies_rcu_gp() and let
> > you carry it with your series?  That way I don't have an unused function
> > in -rcu and you don't have to wait for me to send it upstream?
> Sound reasonable to me. Also thanks for your suggestions.

Here you go!  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit 2eac2f7a9a6d8921e8084a6acdffa595e99dbd17
Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
Date:   Thu Oct 13 11:54:13 2022 -0700

    rcu-tasks: Provide rcu_trace_implies_rcu_gp()
    
    As an accident of implementation, an RCU Tasks Trace grace period also
    acts as an RCU grace period.  However, this could change at any time.
    This commit therefore creates an rcu_trace_implies_rcu_gp() that currently
    returns true to codify this accident.  Code relying on this accident
    must call this function to verify that this accident is still happening.
    
    Reported-by: Hou Tao <houtao@xxxxxxxxxxxxxxx>
    Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
    Cc: Alexei Starovoitov <ast@xxxxxxxxxx>
    Cc: Martin KaFai Lau <martin.lau@xxxxxxxxx>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 08605ce7379d7..8822f06e4b40c 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -240,6 +240,18 @@ static inline void exit_tasks_rcu_start(void) { }
 static inline void exit_tasks_rcu_finish(void) { }
 #endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
 
+/**
+ * rcu_trace_implies_rcu_gp - does an RCU Tasks Trace grace period imply an RCU grace period?
+ *
+ * As an accident of implementation, an RCU Tasks Trace grace period also
+ * acts as an RCU grace period.  However, this could change at any time.
+ * Code relying on this accident must call this function to verify that
+ * this accident is still happening.
+ *
+ * You have been warned!
+ */
+static inline bool rcu_trace_implies_rcu_gp(void) { return true; }
+
 /**
  * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU
  *
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index b0b885e071fa8..fe9840d90e960 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1535,6 +1535,8 @@ static void rcu_tasks_trace_postscan(struct list_head *hop)
 {
 	// Wait for late-stage exiting tasks to finish exiting.
 	// These might have passed the call to exit_tasks_rcu_finish().
+
+	// If you remove the following line, update rcu_trace_implies_rcu_gp()!!!
 	synchronize_rcu();
 	// Any tasks that exit after this point will set
 	// TRC_NEED_QS_CHECKED in ->trc_reader_special.b.need_qs.