On Thu, Oct 13, 2022 at 09:25:31AM +0800, Hou Tao wrote: > Hi, > > On 10/13/2022 12:11 AM, Paul E. McKenney wrote: > > On Wed, Oct 12, 2022 at 05:26:26PM +0800, Hou Tao wrote: > >> Hi, > >> > >> On 10/12/2022 2:36 PM, Paul E. McKenney wrote: > >>> On Tue, Oct 11, 2022 at 07:31:28PM +0800, Hou Tao wrote: > >>>> Hi, > >>>> > >>>> On 10/11/2022 5:07 PM, Paul E. McKenney wrote: > >>>>> On Tue, Oct 11, 2022 at 03:11:26PM +0800, Hou Tao wrote: > >>>>>> From: Hou Tao <houtao1@xxxxxxxxxx> > >>>>>> > >>>>>> According to the implementation of RCU Tasks Trace, it inovkes > >>>>>> ->postscan_func() to wait for one RCU-tasks-trace grace period and > >>>>>> rcu_tasks_trace_postscan() inovkes synchronize_rcu() to wait for one > >>>>>> normal RCU grace period in turn, so one RCU-tasks-trace grace period > >>>>>> will imply one RCU grace period. > >>>>>> > >>>>>> So there is no need to do call_rcu() again in the callback of > >>>>>> call_rcu_tasks_trace() and it can just free these elements directly. > >>>>> This is true, but this is an implementation detail that is not guaranteed > >>>>> in future versions of the kernel. But if this additional call_rcu() > >>>>> is causing trouble, I could add some API member that returned true in > >>>>> kernels where it does happen to be the case that call_rcu_tasks_trace() > >>>>> implies a call_rcu()-style grace period. > >>>>> > >>>>> The BPF memory allocator could then complain or adapt, as appropriate. > >>>>> > >>>>> Thoughts? > >>>> It is indeed an implementation details. But In an idle KVM guest, for memory > >>>> reclamation in bpf memory allocator a RCU tasks trace grace period is about 30ms > >>>> and RCU grace period is about 20 ms. Under stress condition, the grace period > >>>> will be much longer. If the extra RCU grace period can be removed, these memory > >>>> can be reclaimed more quickly and it will be beneficial for memory pressure. > >>> I understand the benefits. We just need to get a safe way to take > >>> advantage of them. > >>> > >>>> Now it seems we can use RCU poll APIs (e.g. get_state_synchronize_rcu() and > >>>> poll_state_synchronize_rcu() ) to check whether or not a RCU grace period has > >>>> passed. But It needs to add at least one unsigned long into the freeing object. > >>>> The extra memory overhead may be OK for bpf memory allocator, but it is not for > >>>> small object. So could you please show example on how these new APIs work ? Does > >>>> it need to modify the to-be-free object ? > >>> Good point on the polling APIs, more on this below. > >>> > >>> I was thinking in terms of an API like this: > >>> > >>> static inline bool rcu_trace_implies_rcu_gp(void) > >>> { > >>> return true; > >>> } > >>> > >>> Along with comments on the synchronize_rcu() pointing at the > >>> rcu_trace_implies_rcu_gp(). > >> It is a simple API and the modifications for call_rcu_tasks_trace() users will > >> also be simple. The callback of call_rcu_tasks_trace() will invoke > >> rcu_trace_implies_rcu_gp(), and it returns true, the callback invokes the > >> callback for call_rcu() directly, else it does so through call_rcu(). > > Sounds good! > > > > Please note that if the callback function just does kfree() or equivalent, > > this will work fine. If it acquires spinlocks, you may need to do > > local_bh_disable() before invoking it directly and local_bh_enable() > > afterwards. > What is the purpose for invoking local_bh_disable() ? Is it trying to ensure the > callback is called under soft-irq context or something else ? For all I know, > task rcu already invokes its callback with soft-irq disabled. > > > >>> Another approach is to wait for the grace periods concurrently, but this > >>> requires not one but two rcu_head structures. > >> Beside the extra space usage, does it also complicate the logic in callback > >> function because it needs to handle the concurrency problem ? > > Definitely!!! > > > > Perhaps something like this: > > > > static void cbf(struct rcu_head *rhp) > > { > > p = container_of(rhp, ...); > > > > if (atomic_dec_and_test(&p->cbs_awaiting)) > > kfree(p); > > } > > > > atomic_set(&p->cbs_awating, 2); > > call_rcu(p->rh1, cbf); > > call_rcu_tasks_trace(p->rh2, cbf); > > > > Is this worth it? I have no idea. I must defer to you. > I still prefer the simple solution. > > > >>> Back to the polling API. Are these things freed individually, or can > >>> they be grouped? If they can be grouped, the storage for the grace-period > >>> state could be associated with the group. > >> As said above, for bpf memory allocator it may be OK because it frees elements > >> in batch, but for bpf local storage and its element these memories are freed > >> individually. So I think if the implication of RCU tasks trace grace period will > >> not be changed in the foreseeable future, adding rcu_trace_implies_rcu_gp() and > >> using it in bpf is a good idea. What do you think ? > > Maybe the BPF memory allocator does it one way and BPF local storage > > does it another way. > Why not. Maybe bpf expert think the space overhead is also reasonable in the BPF > local storage case. > > > > How about if I produce a patch for rcu_trace_implies_rcu_gp() and let > > you carry it with your series? That way I don't have an unused function > > in -rcu and you don't have to wait for me to send it upstream? > Sound reasonable to me. Also thanks for your suggestions. Here you go! Thoughts? Thanx, Paul ------------------------------------------------------------------------ commit 2eac2f7a9a6d8921e8084a6acdffa595e99dbd17 Author: Paul E. McKenney <paulmck@xxxxxxxxxx> Date: Thu Oct 13 11:54:13 2022 -0700 rcu-tasks: Provide rcu_trace_implies_rcu_gp() As an accident of implementation, an RCU Tasks Trace grace period also acts as an RCU grace period. However, this could change at any time. This commit therefore creates an rcu_trace_implies_rcu_gp() that currently returns true to codify this accident. Code relying on this accident must call this function to verify that this accident is still happening. Reported-by: Hou Tao <houtao@xxxxxxxxxxxxxxx> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx> Cc: Alexei Starovoitov <ast@xxxxxxxxxx> Cc: Martin KaFai Lau <martin.lau@xxxxxxxxx> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 08605ce7379d7..8822f06e4b40c 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -240,6 +240,18 @@ static inline void exit_tasks_rcu_start(void) { } static inline void exit_tasks_rcu_finish(void) { } #endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */ +/** + * rcu_trace_implies_rcu_gp - does an RCU Tasks Trace grace period imply an RCU grace period? + * + * As an accident of implementation, an RCU Tasks Trace grace period also + * acts as an RCU grace period. However, this could change at any time. + * Code relying on this accident must call this function to verify that + * this accident is still happening. + * + * You have been warned! + */ +static inline bool rcu_trace_implies_rcu_gp(void) { return true; } + /** * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU * diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index b0b885e071fa8..fe9840d90e960 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -1535,6 +1535,8 @@ static void rcu_tasks_trace_postscan(struct list_head *hop) { // Wait for late-stage exiting tasks to finish exiting. // These might have passed the call to exit_tasks_rcu_finish(). + + // If you remove the following line, update rcu_trace_implies_rcu_gp()!!! synchronize_rcu(); // Any tasks that exit after this point will set // TRC_NEED_QS_CHECKED in ->trc_reader_special.b.need_qs.