Re: [PATCH bpf-next 1/3] bpf: Free elements after one RCU-tasks-trace grace period

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Thu, 13 Oct 2022 09:25:31 +0800



Hi,

On 10/13/2022 12:11 AM, Paul E. McKenney wrote:
> On Wed, Oct 12, 2022 at 05:26:26PM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 10/12/2022 2:36 PM, Paul E. McKenney wrote:
>>> On Tue, Oct 11, 2022 at 07:31:28PM +0800, Hou Tao wrote:
>>>> Hi,
>>>>
>>>> On 10/11/2022 5:07 PM, Paul E. McKenney wrote:
>>>>> On Tue, Oct 11, 2022 at 03:11:26PM +0800, Hou Tao wrote:
>>>>>> From: Hou Tao <houtao1@xxxxxxxxxx>
>>>>>>
>>>>>> According to the implementation of RCU Tasks Trace, it inovkes
>>>>>> ->postscan_func() to wait for one RCU-tasks-trace grace period and
>>>>>> rcu_tasks_trace_postscan() inovkes synchronize_rcu() to wait for one
>>>>>> normal RCU grace period in turn, so one RCU-tasks-trace grace period
>>>>>> will imply one RCU grace period.
>>>>>>
>>>>>> So there is no need to do call_rcu() again in the callback of
>>>>>> call_rcu_tasks_trace() and it can just free these elements directly.
>>>>> This is true, but this is an implementation detail that is not guaranteed
>>>>> in future versions of the kernel.  But if this additional call_rcu()
>>>>> is causing trouble, I could add some API member that returned true in
>>>>> kernels where it does happen to be the case that call_rcu_tasks_trace()
>>>>> implies a call_rcu()-style grace period.
>>>>>
>>>>> The BPF memory allocator could then complain or adapt, as appropriate.
>>>>>
>>>>> Thoughts?
>>>> It is indeed an implementation details. But In an idle KVM guest, for memory
>>>> reclamation in bpf memory allocator a RCU tasks trace grace period is about 30ms
>>>> and RCU grace period is about 20 ms. Under stress condition, the grace period
>>>> will be much longer. If the extra RCU grace period can be removed, these memory
>>>> can be reclaimed more quickly and it will be beneficial for memory pressure.
>>> I understand the benefits.  We just need to get a safe way to take
>>> advantage of them.
>>>
>>>> Now it seems we can use RCU poll APIs (e.g. get_state_synchronize_rcu() and
>>>> poll_state_synchronize_rcu() ) to check whether or not a RCU grace period has
>>>> passed. But It needs to add at least one unsigned long into the freeing object.
>>>> The extra memory overhead may be OK for bpf memory allocator, but it is not for
>>>> small object. So could you please show example on how these new APIs work ? Does
>>>> it need to modify the to-be-free object ?
>>> Good point on the polling APIs, more on this below.
>>>
>>> I was thinking in terms of an API like this:
>>>
>>> 	static inline bool rcu_trace_implies_rcu_gp(void)
>>> 	{
>>> 		return true;
>>> 	}
>>>
>>> Along with comments on the synchronize_rcu() pointing at the
>>> rcu_trace_implies_rcu_gp().
>> It is a simple API and the modifications for call_rcu_tasks_trace() users will
>> also be simple. The callback of call_rcu_tasks_trace() will invoke
>> rcu_trace_implies_rcu_gp(), and it returns true, the callback invokes the
>> callback for call_rcu() directly, else it does so through call_rcu().
> Sounds good!
>
> Please note that if the callback function just does kfree() or equivalent,
> this will work fine.  If it acquires spinlocks, you may need to do
> local_bh_disable() before invoking it directly and local_bh_enable()
> afterwards.
What is the purpose for invoking local_bh_disable() ? Is it trying to ensure the
callback is called under soft-irq context or something else ? For all I know,
task rcu already invokes its callback with soft-irq disabled.
>
>>> Another approach is to wait for the grace periods concurrently, but this
>>> requires not one but two rcu_head structures.
>> Beside the extra space usage, does it also complicate the logic in callback
>> function because it needs to handle the concurrency problem ?
> Definitely!!!
>
> Perhaps something like this:
>
> 	static void cbf(struct rcu_head *rhp)
> 	{
> 		p = container_of(rhp, ...);
>
> 		if (atomic_dec_and_test(&p->cbs_awaiting))
> 			kfree(p);
> 	}
>
> 	atomic_set(&p->cbs_awating, 2);
> 	call_rcu(p->rh1, cbf);
> 	call_rcu_tasks_trace(p->rh2, cbf);
>
> Is this worth it?  I have no idea.  I must defer to you.
I still prefer the simple solution.
>
>>> Back to the polling API.  Are these things freed individually, or can
>>> they be grouped?  If they can be grouped, the storage for the grace-period
>>> state could be associated with the group.
>> As said above, for bpf memory allocator it may be OK because it frees elements
>> in batch, but for bpf local storage and its element these memories are freed
>> individually. So I think if the implication of RCU tasks trace grace period will
>> not be changed in the foreseeable future, adding rcu_trace_implies_rcu_gp() and
>> using it in bpf is a good idea. What do you think ?
> Maybe the BPF memory allocator does it one way and BPF local storage
> does it another way.
Why not. Maybe bpf expert think the space overhead is also reasonable in the BPF
local storage case.
>
> How about if I produce a patch for rcu_trace_implies_rcu_gp() and let
> you carry it with your series?  That way I don't have an unused function
> in -rcu and you don't have to wait for me to send it upstream?
Sound reasonable to me. Also thanks for your suggestions.
>
> 							Thanx, Paul
>
>>>>>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
>>>>>> ---
>>>>>>  kernel/bpf/memalloc.c | 17 ++++++-----------
>>>>>>  1 file changed, 6 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>>>>> index 5f83be1d2018..6f32dddc804f 100644
>>>>>> --- a/kernel/bpf/memalloc.c
>>>>>> +++ b/kernel/bpf/memalloc.c
>>>>>> @@ -209,6 +209,9 @@ static void free_one(struct bpf_mem_cache *c, void *obj)
>>>>>>  	kfree(obj);
>>>>>>  }
>>>>>>  
>>>>>> +/* Now RCU Tasks grace period implies RCU grace period, so no need to do
>>>>>> + * an extra call_rcu().
>>>>>> + */
>>>>>>  static void __free_rcu(struct rcu_head *head)
>>>>>>  {
>>>>>>  	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>>>>>> @@ -220,13 +223,6 @@ static void __free_rcu(struct rcu_head *head)
>>>>>>  	atomic_set(&c->call_rcu_in_progress, 0);
>>>>>>  }
>>>>>>  
>>>>>> -static void __free_rcu_tasks_trace(struct rcu_head *head)
>>>>>> -{
>>>>>> -	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>>>>>> -
>>>>>> -	call_rcu(&c->rcu, __free_rcu);
>>>>>> -}
>>>>>> -
>>>>>>  static void enque_to_free(struct bpf_mem_cache *c, void *obj)
>>>>>>  {
>>>>>>  	struct llist_node *llnode = obj;
>>>>>> @@ -252,11 +248,10 @@ static void do_call_rcu(struct bpf_mem_cache *c)
>>>>>>  		 * from __free_rcu() and from drain_mem_cache().
>>>>>>  		 */
>>>>>>  		__llist_add(llnode, &c->waiting_for_gp);
>>>>>> -	/* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
>>>>>> -	 * Then use call_rcu() to wait for normal progs to finish
>>>>>> -	 * and finally do free_one() on each element.
>>>>>> +	/* Use call_rcu_tasks_trace() to wait for both sleepable and normal
>>>>>> +	 * progs to finish and finally do free_one() on each element.
>>>>>>  	 */
>>>>>> -	call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace);
>>>>>> +	call_rcu_tasks_trace(&c->rcu, __free_rcu);
>>>>>>  }
>>>>>>  
>>>>>>  static void free_bulk(struct bpf_mem_cache *c)
>>>>>> -- 
>>>>>> 2.29.2
>>>>>>