Re: [PATCH bpf-next 1/3] bpf: Free elements after one RCU-tasks-trace grace period

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Wed, 12 Oct 2022 17:26:26 +0800



Hi,

On 10/12/2022 2:36 PM, Paul E. McKenney wrote:
> On Tue, Oct 11, 2022 at 07:31:28PM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 10/11/2022 5:07 PM, Paul E. McKenney wrote:
>>> On Tue, Oct 11, 2022 at 03:11:26PM +0800, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@xxxxxxxxxx>
>>>>
>>>> According to the implementation of RCU Tasks Trace, it inovkes
>>>> ->postscan_func() to wait for one RCU-tasks-trace grace period and
>>>> rcu_tasks_trace_postscan() inovkes synchronize_rcu() to wait for one
>>>> normal RCU grace period in turn, so one RCU-tasks-trace grace period
>>>> will imply one RCU grace period.
>>>>
>>>> So there is no need to do call_rcu() again in the callback of
>>>> call_rcu_tasks_trace() and it can just free these elements directly.
>>> This is true, but this is an implementation detail that is not guaranteed
>>> in future versions of the kernel.  But if this additional call_rcu()
>>> is causing trouble, I could add some API member that returned true in
>>> kernels where it does happen to be the case that call_rcu_tasks_trace()
>>> implies a call_rcu()-style grace period.
>>>
>>> The BPF memory allocator could then complain or adapt, as appropriate.
>>>
>>> Thoughts?
>> It is indeed an implementation details. But In an idle KVM guest, for memory
>> reclamation in bpf memory allocator a RCU tasks trace grace period is about 30ms
>> and RCU grace period is about 20 ms. Under stress condition, the grace period
>> will be much longer. If the extra RCU grace period can be removed, these memory
>> can be reclaimed more quickly and it will be beneficial for memory pressure.
> I understand the benefits.  We just need to get a safe way to take
> advantage of them.
>
>> Now it seems we can use RCU poll APIs (e.g. get_state_synchronize_rcu() and
>> poll_state_synchronize_rcu() ) to check whether or not a RCU grace period has
>> passed. But It needs to add at least one unsigned long into the freeing object.
>> The extra memory overhead may be OK for bpf memory allocator, but it is not for
>> small object. So could you please show example on how these new APIs work ? Does
>> it need to modify the to-be-free object ?
> Good point on the polling APIs, more on this below.
>
> I was thinking in terms of an API like this:
>
> 	static inline bool rcu_trace_implies_rcu_gp(void)
> 	{
> 		return true;
> 	}
>
> Along with comments on the synchronize_rcu() pointing at the
> rcu_trace_implies_rcu_gp().
It is a simple API and the modifications for call_rcu_tasks_trace() users will
also be simple. The callback of call_rcu_tasks_trace() will invoke
rcu_trace_implies_rcu_gp(), and it returns true, the callback invokes the
callback for call_rcu() directly, else it does so through call_rcu().
>
> Another approach is to wait for the grace periods concurrently, but this
> requires not one but two rcu_head structures.
Beside the extra space usage, does it also complicate the logic in callback
function because it needs to handle the concurrency problem ?
>
> Back to the polling API.  Are these things freed individually, or can
> they be grouped?  If they can be grouped, the storage for the grace-period
> state could be associated with the group.
As said above, for bpf memory allocator it may be OK because it frees elements
in batch, but for bpf local storage and its element these memories are freed
individually. So I think if the implication of RCU tasks trace grace period will
not be changed in the foreseeable future, adding rcu_trace_implies_rcu_gp() and
using it in bpf is a good idea. What do you think ?
>
> 							Thanx, Paul
>
>>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
>>>> ---
>>>>  kernel/bpf/memalloc.c | 17 ++++++-----------
>>>>  1 file changed, 6 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>>> index 5f83be1d2018..6f32dddc804f 100644
>>>> --- a/kernel/bpf/memalloc.c
>>>> +++ b/kernel/bpf/memalloc.c
>>>> @@ -209,6 +209,9 @@ static void free_one(struct bpf_mem_cache *c, void *obj)
>>>>  	kfree(obj);
>>>>  }
>>>>  
>>>> +/* Now RCU Tasks grace period implies RCU grace period, so no need to do
>>>> + * an extra call_rcu().
>>>> + */
>>>>  static void __free_rcu(struct rcu_head *head)
>>>>  {
>>>>  	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>>>> @@ -220,13 +223,6 @@ static void __free_rcu(struct rcu_head *head)
>>>>  	atomic_set(&c->call_rcu_in_progress, 0);
>>>>  }
>>>>  
>>>> -static void __free_rcu_tasks_trace(struct rcu_head *head)
>>>> -{
>>>> -	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>>>> -
>>>> -	call_rcu(&c->rcu, __free_rcu);
>>>> -}
>>>> -
>>>>  static void enque_to_free(struct bpf_mem_cache *c, void *obj)
>>>>  {
>>>>  	struct llist_node *llnode = obj;
>>>> @@ -252,11 +248,10 @@ static void do_call_rcu(struct bpf_mem_cache *c)
>>>>  		 * from __free_rcu() and from drain_mem_cache().
>>>>  		 */
>>>>  		__llist_add(llnode, &c->waiting_for_gp);
>>>> -	/* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
>>>> -	 * Then use call_rcu() to wait for normal progs to finish
>>>> -	 * and finally do free_one() on each element.
>>>> +	/* Use call_rcu_tasks_trace() to wait for both sleepable and normal
>>>> +	 * progs to finish and finally do free_one() on each element.
>>>>  	 */
>>>> -	call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace);
>>>> +	call_rcu_tasks_trace(&c->rcu, __free_rcu);
>>>>  }
>>>>  
>>>>  static void free_bulk(struct bpf_mem_cache *c)
>>>> -- 
>>>> 2.29.2
>>>>