On Tue, Oct 11, 2022 at 07:31:28PM +0800, Hou Tao wrote: > Hi, > > On 10/11/2022 5:07 PM, Paul E. McKenney wrote: > > On Tue, Oct 11, 2022 at 03:11:26PM +0800, Hou Tao wrote: > >> From: Hou Tao <houtao1@xxxxxxxxxx> > >> > >> According to the implementation of RCU Tasks Trace, it inovkes > >> ->postscan_func() to wait for one RCU-tasks-trace grace period and > >> rcu_tasks_trace_postscan() inovkes synchronize_rcu() to wait for one > >> normal RCU grace period in turn, so one RCU-tasks-trace grace period > >> will imply one RCU grace period. > >> > >> So there is no need to do call_rcu() again in the callback of > >> call_rcu_tasks_trace() and it can just free these elements directly. > > This is true, but this is an implementation detail that is not guaranteed > > in future versions of the kernel. But if this additional call_rcu() > > is causing trouble, I could add some API member that returned true in > > kernels where it does happen to be the case that call_rcu_tasks_trace() > > implies a call_rcu()-style grace period. > > > > The BPF memory allocator could then complain or adapt, as appropriate. > > > > Thoughts? > It is indeed an implementation details. But In an idle KVM guest, for memory > reclamation in bpf memory allocator a RCU tasks trace grace period is about 30ms > and RCU grace period is about 20 ms. Under stress condition, the grace period > will be much longer. If the extra RCU grace period can be removed, these memory > can be reclaimed more quickly and it will be beneficial for memory pressure. I understand the benefits. We just need to get a safe way to take advantage of them. > Now it seems we can use RCU poll APIs (e.g. get_state_synchronize_rcu() and > poll_state_synchronize_rcu() ) to check whether or not a RCU grace period has > passed. But It needs to add at least one unsigned long into the freeing object. > The extra memory overhead may be OK for bpf memory allocator, but it is not for > small object. So could you please show example on how these new APIs work ? Does > it need to modify the to-be-free object ? Good point on the polling APIs, more on this below. I was thinking in terms of an API like this: static inline bool rcu_trace_implies_rcu_gp(void) { return true; } Along with comments on the synchronize_rcu() pointing at the rcu_trace_implies_rcu_gp(). Another approach is to wait for the grace periods concurrently, but this requires not one but two rcu_head structures. Back to the polling API. Are these things freed individually, or can they be grouped? If they can be grouped, the storage for the grace-period state could be associated with the group. Thanx, Paul > >> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx> > >> --- > >> kernel/bpf/memalloc.c | 17 ++++++----------- > >> 1 file changed, 6 insertions(+), 11 deletions(-) > >> > >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > >> index 5f83be1d2018..6f32dddc804f 100644 > >> --- a/kernel/bpf/memalloc.c > >> +++ b/kernel/bpf/memalloc.c > >> @@ -209,6 +209,9 @@ static void free_one(struct bpf_mem_cache *c, void *obj) > >> kfree(obj); > >> } > >> > >> +/* Now RCU Tasks grace period implies RCU grace period, so no need to do > >> + * an extra call_rcu(). > >> + */ > >> static void __free_rcu(struct rcu_head *head) > >> { > >> struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu); > >> @@ -220,13 +223,6 @@ static void __free_rcu(struct rcu_head *head) > >> atomic_set(&c->call_rcu_in_progress, 0); > >> } > >> > >> -static void __free_rcu_tasks_trace(struct rcu_head *head) > >> -{ > >> - struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu); > >> - > >> - call_rcu(&c->rcu, __free_rcu); > >> -} > >> - > >> static void enque_to_free(struct bpf_mem_cache *c, void *obj) > >> { > >> struct llist_node *llnode = obj; > >> @@ -252,11 +248,10 @@ static void do_call_rcu(struct bpf_mem_cache *c) > >> * from __free_rcu() and from drain_mem_cache(). > >> */ > >> __llist_add(llnode, &c->waiting_for_gp); > >> - /* Use call_rcu_tasks_trace() to wait for sleepable progs to finish. > >> - * Then use call_rcu() to wait for normal progs to finish > >> - * and finally do free_one() on each element. > >> + /* Use call_rcu_tasks_trace() to wait for both sleepable and normal > >> + * progs to finish and finally do free_one() on each element. > >> */ > >> - call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace); > >> + call_rcu_tasks_trace(&c->rcu, __free_rcu); > >> } > >> > >> static void free_bulk(struct bpf_mem_cache *c) > >> -- > >> 2.29.2 > >> >