Re: [PATCH] rcu: Reduce synchronize_rcu() delays when all wait heads are in use

Neeraj Upadhyay <Neeraj.Upadhyay@xxxxxxx> · Wed, 13 Mar 2024 22:56:01 +0530

On 3/13/2024 10:45 PM, Frederic Weisbecker wrote:
> Le Wed, Mar 13, 2024 at 10:24:58PM +0530, Neeraj Upadhyay a écrit :
>> Hi Frederic,
>>
>> On 3/13/2024 10:13 PM, Frederic Weisbecker wrote:
>>> Le Wed, Mar 13, 2024 at 09:41:58PM +0530, Neeraj Upadhyay a écrit :
>>>> Hi Frederic,
>>>>
>>>> On 3/13/2024 8:48 PM, Frederic Weisbecker wrote:
>>>>> Le Wed, Mar 13, 2024 at 02:02:28PM +0530, Neeraj Upadhyay a écrit :
>>>>>> When all wait heads are in use, which can happen when
>>>>>> rcu_sr_normal_gp_cleanup_work()'s callback processing
>>>>>> is slow, any new synchronize_rcu() user's rcu_synchronize
>>>>>> node's processing is deferred to future GP periods. This
>>>>>> can result in long list of synchronize_rcu() invocations
>>>>>> waiting for full grace period processing, which can delay
>>>>>> freeing of memory. Mitigate this problem by using first
>>>>>> node in the list as wait tail when all wait heads are in use.
>>>>>> While methods to speed up callback processing would be needed
>>>>>> to recover from this situation, allowing new nodes to complete
>>>>>> their grace period can help prevent delays due to a fixed
>>>>>> number of wait head nodes.
>>>>>>
>>>>>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@xxxxxxx>
>>>>>> ---
>>>>>>  kernel/rcu/tree.c | 27 +++++++++++++--------------
>>>>>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>>>> index 9fbb5ab57c84..bdccce1ed62f 100644
>>>>>> --- a/kernel/rcu/tree.c
>>>>>> +++ b/kernel/rcu/tree.c
>>>>>> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>>>>>>   * for this new grace period. Given that there are a fixed
>>>>>>   * number of wait nodes, if all wait nodes are in use
>>>>>>   * (which can happen when kworker callback processing
>>>>>> - * is delayed) and additional grace period is requested.
>>>>>> - * This means, a system is slow in processing callbacks.
>>>>>> - *
>>>>>> - * TODO: If a slow processing is detected, a first node
>>>>>> - * in the llist should be used as a wait-tail for this
>>>>>> - * grace period, therefore users which should wait due
>>>>>> - * to a slow process are handled by _this_ grace period
>>>>>> - * and not next.
>>>>>> + * is delayed), first node in the llist is used as wait
>>>>>> + * tail for this grace period. This means, the first node
>>>>>> + * has to go through additional grace periods before it is
>>>>>> + * part of the wait callbacks. This should be ok, as
>>>>>> + * the system is slow in processing callbacks anyway.
>>>>>>   *
>>>>>>   * Below is an illustration of how the done and wait
>>>>>>   * tail pointers move from one set of rcu_synchronize nodes
>>>>>> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
>>>>>>  		return start_new_poll;
>>>>>>  
>>>>>>  	wait_head = rcu_sr_get_wait_head();
>>>>>> -	if (!wait_head) {
>>>>>> -		// Kick another GP to retry.
>>>>>> +	if (wait_head) {
>>>>>> +		/* Inject a wait-dummy-node. */
>>>>>> +		llist_add(wait_head, &rcu_state.srs_next);
>>>>>> +	} else {
>>>>>> +		// Kick another GP for first node.
>>>>>>  		start_new_poll = true;
>>>>>> -		return start_new_poll;
>>>>>> +		if (first == rcu_state.srs_done_tail)
>>>>>> +			return start_new_poll;
>>>>>> +		wait_head = first;
>>>>>
>>>>> This means you're setting a non-wait-head as srs_wait_tail, right?
>>>>> Doesn't it trigger the following warning in rcu_sr_normal_gp_cleanup():
>>>>>
>>>>> 	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
>>>>>
>>>>
>>>> Oh I missed it. Will fix it, thanks!
>>>>
>>>>> Also there is a risk that this non-wait-head gets later assigned as
>>>>> rcu_state.srs_done_tail. And then this pending sr may not be completed
>>>>> until the next grace period calling rcu_sr_normal_gp_cleanup()? (Because
>>>>> the work doesn't take care of rcu_state.srs_done_tail itself). And then
>>>>> the delay can be arbitrary.
>>>>>
>>>>
>>>> That is correct. Only the first node suffers from deferred GP.
>>>> If there are large number of callbacks which got added after
>>>> last available wait head was queued, all those callbacks (except one)
>>>> can still have a GP assigned to them.
>>>>
>>>>> And the next grace period completing this sr (that non-wait-head set
>>>>> as rcu_state.srs_done_tail) and thus allowing its caller to wipe it out
>>>>> of its stack may race with the cleanup work dereferencing it?
>>>>>
>>>>
>>>> This sr can only be completed when done tail moves to new node. Till
>>>> then, it gets deferred continuously. So, we won't be entering into
>>>> the situation where the sr processing is complete while done tail is pointing
>>>> to it. Please correct me if I am missing something here.
>>>
>>> Ok I'm confused as usual. Let's take a practical case. Is the following
>>> sequence possible?
>>>
>>> 1) wait_tail = NULL
>>>    done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
>>>
>>> Initial layout
>>>
>>> 2) wait_tail = SR5 -> WH4...
>>>    done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
>>>
>>> New GP
>>>
>>> 3) wait_tail = NULL
>>>    done_tail = SR5->WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
>>>
>>> GP completes, normal cleanup
>>>
>>> 3) wait_tail = SR6->SR5...
>>>    done_tail = SR5->WH4->SR4->WH3->SR2->WH2->SR1->WH1->SR1...
>>>
>>> New GP
>>>
>>> 4) GP completes and SR5 is completed by rcu_sr_normal_gp_cleanup(). So
>>>    the caller releases it from the stack. But before rcu_sr_normal_gp_cleanup()
>>>    overwrites done_tail to SR6->WH4->SR4.... , the workqueue manages to run
>>>    and concurrently dereferences SR5.
>>>
>>> But I bet I'm missing something obvious in the middle, preventing that...
>>
>> Your analysis looks correct to me. Maybe, one way to fix this can be that
>> rcu_sr_normal_gp_cleanup() stops processing nodes in its context,
>> when it reaches done tail and done tail is not a wait head. I will
>> think more on this, thanks!
> 
> That alone is probably not enough. In the end you may end up with a real
> pending sr stuck as done tail without completion, until one day a
> new real queue comes up, preferably with a real wait head in order not
> to get stuck with a new sr as done tail.
> 

But after point 4 in your sequence, rcu_sr_normal_gp_cleanup() would move
the done tail to SR6 and queue a new work, which will process SR5,
so, we will be able to progress real pending srs?

Thanks
Neeraj

>>
>>
>> Thanks
>> Neeraj