Hi, On 10/13/2022 12:11 AM, Paul E. McKenney wrote: > On Wed, Oct 12, 2022 at 05:26:26PM +0800, Hou Tao wrote: >> Hi, >> >> On 10/12/2022 2:36 PM, Paul E. McKenney wrote: >>> On Tue, Oct 11, 2022 at 07:31:28PM +0800, Hou Tao wrote: >>>> Hi, >>>> >>>> On 10/11/2022 5:07 PM, Paul E. McKenney wrote: >>>>> On Tue, Oct 11, 2022 at 03:11:26PM +0800, Hou Tao wrote: >>>>>> From: Hou Tao <houtao1@xxxxxxxxxx> >>>>>> >>>>>> According to the implementation of RCU Tasks Trace, it inovkes >>>>>> ->postscan_func() to wait for one RCU-tasks-trace grace period and >>>>>> rcu_tasks_trace_postscan() inovkes synchronize_rcu() to wait for one >>>>>> normal RCU grace period in turn, so one RCU-tasks-trace grace period >>>>>> will imply one RCU grace period. >>>>>> >>>>>> So there is no need to do call_rcu() again in the callback of >>>>>> call_rcu_tasks_trace() and it can just free these elements directly. >>>>> This is true, but this is an implementation detail that is not guaranteed >>>>> in future versions of the kernel. But if this additional call_rcu() >>>>> is causing trouble, I could add some API member that returned true in >>>>> kernels where it does happen to be the case that call_rcu_tasks_trace() >>>>> implies a call_rcu()-style grace period. >>>>> >>>>> The BPF memory allocator could then complain or adapt, as appropriate. >>>>> >>>>> Thoughts? >>>> It is indeed an implementation details. But In an idle KVM guest, for memory >>>> reclamation in bpf memory allocator a RCU tasks trace grace period is about 30ms >>>> and RCU grace period is about 20 ms. Under stress condition, the grace period >>>> will be much longer. If the extra RCU grace period can be removed, these memory >>>> can be reclaimed more quickly and it will be beneficial for memory pressure. >>> I understand the benefits. We just need to get a safe way to take >>> advantage of them. >>> >>>> Now it seems we can use RCU poll APIs (e.g. get_state_synchronize_rcu() and >>>> poll_state_synchronize_rcu() ) to check whether or not a RCU grace period has >>>> passed. But It needs to add at least one unsigned long into the freeing object. >>>> The extra memory overhead may be OK for bpf memory allocator, but it is not for >>>> small object. So could you please show example on how these new APIs work ? Does >>>> it need to modify the to-be-free object ? >>> Good point on the polling APIs, more on this below. >>> >>> I was thinking in terms of an API like this: >>> >>> static inline bool rcu_trace_implies_rcu_gp(void) >>> { >>> return true; >>> } >>> >>> Along with comments on the synchronize_rcu() pointing at the >>> rcu_trace_implies_rcu_gp(). >> It is a simple API and the modifications for call_rcu_tasks_trace() users will >> also be simple. The callback of call_rcu_tasks_trace() will invoke >> rcu_trace_implies_rcu_gp(), and it returns true, the callback invokes the >> callback for call_rcu() directly, else it does so through call_rcu(). > Sounds good! > > Please note that if the callback function just does kfree() or equivalent, > this will work fine. If it acquires spinlocks, you may need to do > local_bh_disable() before invoking it directly and local_bh_enable() > afterwards. What is the purpose for invoking local_bh_disable() ? Is it trying to ensure the callback is called under soft-irq context or something else ? For all I know, task rcu already invokes its callback with soft-irq disabled. > >>> Another approach is to wait for the grace periods concurrently, but this >>> requires not one but two rcu_head structures. >> Beside the extra space usage, does it also complicate the logic in callback >> function because it needs to handle the concurrency problem ? > Definitely!!! > > Perhaps something like this: > > static void cbf(struct rcu_head *rhp) > { > p = container_of(rhp, ...); > > if (atomic_dec_and_test(&p->cbs_awaiting)) > kfree(p); > } > > atomic_set(&p->cbs_awating, 2); > call_rcu(p->rh1, cbf); > call_rcu_tasks_trace(p->rh2, cbf); > > Is this worth it? I have no idea. I must defer to you. I still prefer the simple solution. > >>> Back to the polling API. Are these things freed individually, or can >>> they be grouped? If they can be grouped, the storage for the grace-period >>> state could be associated with the group. >> As said above, for bpf memory allocator it may be OK because it frees elements >> in batch, but for bpf local storage and its element these memories are freed >> individually. So I think if the implication of RCU tasks trace grace period will >> not be changed in the foreseeable future, adding rcu_trace_implies_rcu_gp() and >> using it in bpf is a good idea. What do you think ? > Maybe the BPF memory allocator does it one way and BPF local storage > does it another way. Why not. Maybe bpf expert think the space overhead is also reasonable in the BPF local storage case. > > How about if I produce a patch for rcu_trace_implies_rcu_gp() and let > you carry it with your series? That way I don't have an unused function > in -rcu and you don't have to wait for me to send it upstream? Sound reasonable to me. Also thanks for your suggestions. > > Thanx, Paul > >>>>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> >>>>>> --- >>>>>> kernel/bpf/memalloc.c | 17 ++++++----------- >>>>>> 1 file changed, 6 insertions(+), 11 deletions(-) >>>>>> >>>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>>>>> index 5f83be1d2018..6f32dddc804f 100644 >>>>>> --- a/kernel/bpf/memalloc.c >>>>>> +++ b/kernel/bpf/memalloc.c >>>>>> @@ -209,6 +209,9 @@ static void free_one(struct bpf_mem_cache *c, void *obj) >>>>>> kfree(obj); >>>>>> } >>>>>> >>>>>> +/* Now RCU Tasks grace period implies RCU grace period, so no need to do >>>>>> + * an extra call_rcu(). >>>>>> + */ >>>>>> static void __free_rcu(struct rcu_head *head) >>>>>> { >>>>>> struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu); >>>>>> @@ -220,13 +223,6 @@ static void __free_rcu(struct rcu_head *head) >>>>>> atomic_set(&c->call_rcu_in_progress, 0); >>>>>> } >>>>>> >>>>>> -static void __free_rcu_tasks_trace(struct rcu_head *head) >>>>>> -{ >>>>>> - struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu); >>>>>> - >>>>>> - call_rcu(&c->rcu, __free_rcu); >>>>>> -} >>>>>> - >>>>>> static void enque_to_free(struct bpf_mem_cache *c, void *obj) >>>>>> { >>>>>> struct llist_node *llnode = obj; >>>>>> @@ -252,11 +248,10 @@ static void do_call_rcu(struct bpf_mem_cache *c) >>>>>> * from __free_rcu() and from drain_mem_cache(). >>>>>> */ >>>>>> __llist_add(llnode, &c->waiting_for_gp); >>>>>> - /* Use call_rcu_tasks_trace() to wait for sleepable progs to finish. >>>>>> - * Then use call_rcu() to wait for normal progs to finish >>>>>> - * and finally do free_one() on each element. >>>>>> + /* Use call_rcu_tasks_trace() to wait for both sleepable and normal >>>>>> + * progs to finish and finally do free_one() on each element. >>>>>> */ >>>>>> - call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace); >>>>>> + call_rcu_tasks_trace(&c->rcu, __free_rcu); >>>>>> } >>>>>> >>>>>> static void free_bulk(struct bpf_mem_cache *c) >>>>>> -- >>>>>> 2.29.2 >>>>>>