Re: [PATCH] rcu: Reduce synchronize_rcu() delays when all wait heads are in use

Neeraj Upadhyay <Neeraj.Upadhyay@xxxxxxx> · Wed, 13 Mar 2024 23:19:08 +0530

On 3/13/2024 10:56 PM, Neeraj Upadhyay wrote:
> 
> 
> On 3/13/2024 10:45 PM, Frederic Weisbecker wrote:
>> Le Wed, Mar 13, 2024 at 10:24:58PM +0530, Neeraj Upadhyay a écrit :
>>> Hi Frederic,
>>>
>>> On 3/13/2024 10:13 PM, Frederic Weisbecker wrote:
>>>> Le Wed, Mar 13, 2024 at 09:41:58PM +0530, Neeraj Upadhyay a écrit :
>>>>> Hi Frederic,
>>>>>
>>>>> On 3/13/2024 8:48 PM, Frederic Weisbecker wrote:
>>>>>> Le Wed, Mar 13, 2024 at 02:02:28PM +0530, Neeraj Upadhyay a écrit :
>>>>>>> When all wait heads are in use, which can happen when
>>>>>>> rcu_sr_normal_gp_cleanup_work()'s callback processing
>>>>>>> is slow, any new synchronize_rcu() user's rcu_synchronize
>>>>>>> node's processing is deferred to future GP periods. This
>>>>>>> can result in long list of synchronize_rcu() invocations
>>>>>>> waiting for full grace period processing, which can delay
>>>>>>> freeing of memory. Mitigate this problem by using first
>>>>>>> node in the list as wait tail when all wait heads are in use.
>>>>>>> While methods to speed up callback processing would be needed
>>>>>>> to recover from this situation, allowing new nodes to complete
>>>>>>> their grace period can help prevent delays due to a fixed
>>>>>>> number of wait head nodes.
>>>>>>>
>>>>>>> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@xxxxxxx>
>>>>>>> ---
>>>>>>>  kernel/rcu/tree.c | 27 +++++++++++++--------------
>>>>>>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>>>>>>
>>>>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>>>>> index 9fbb5ab57c84..bdccce1ed62f 100644
>>>>>>> --- a/kernel/rcu/tree.c
>>>>>>> +++ b/kernel/rcu/tree.c
>>>>>>> @@ -1470,14 +1470,11 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>>>>>>>   * for this new grace period. Given that there are a fixed
>>>>>>>   * number of wait nodes, if all wait nodes are in use
>>>>>>>   * (which can happen when kworker callback processing
>>>>>>> - * is delayed) and additional grace period is requested.
>>>>>>> - * This means, a system is slow in processing callbacks.
>>>>>>> - *
>>>>>>> - * TODO: If a slow processing is detected, a first node
>>>>>>> - * in the llist should be used as a wait-tail for this
>>>>>>> - * grace period, therefore users which should wait due
>>>>>>> - * to a slow process are handled by _this_ grace period
>>>>>>> - * and not next.
>>>>>>> + * is delayed), first node in the llist is used as wait
>>>>>>> + * tail for this grace period. This means, the first node
>>>>>>> + * has to go through additional grace periods before it is
>>>>>>> + * part of the wait callbacks. This should be ok, as
>>>>>>> + * the system is slow in processing callbacks anyway.
>>>>>>>   *
>>>>>>>   * Below is an illustration of how the done and wait
>>>>>>>   * tail pointers move from one set of rcu_synchronize nodes
>>>>>>> @@ -1725,15 +1722,17 @@ static bool rcu_sr_normal_gp_init(void)
>>>>>>>  		return start_new_poll;
>>>>>>>  
>>>>>>>  	wait_head = rcu_sr_get_wait_head();
>>>>>>> -	if (!wait_head) {
>>>>>>> -		// Kick another GP to retry.
>>>>>>> +	if (wait_head) {
>>>>>>> +		/* Inject a wait-dummy-node. */
>>>>>>> +		llist_add(wait_head, &rcu_state.srs_next);
>>>>>>> +	} else {
>>>>>>> +		// Kick another GP for first node.
>>>>>>>  		start_new_poll = true;
>>>>>>> -		return start_new_poll;
>>>>>>> +		if (first == rcu_state.srs_done_tail)
>>>>>>> +			return start_new_poll;
>>>>>>> +		wait_head = first;
>>>>>>
>>>>>> This means you're setting a non-wait-head as srs_wait_tail, right?
>>>>>> Doesn't it trigger the following warning in rcu_sr_normal_gp_cleanup():
>>>>>>
>>>>>> 	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
>>>>>>
>>>>>
>>>>> Oh I missed it. Will fix it, thanks!
>>>>>
>>>>>> Also there is a risk that this non-wait-head gets later assigned as
>>>>>> rcu_state.srs_done_tail. And then this pending sr may not be completed
>>>>>> until the next grace period calling rcu_sr_normal_gp_cleanup()? (Because
>>>>>> the work doesn't take care of rcu_state.srs_done_tail itself). And then
>>>>>> the delay can be arbitrary.
>>>>>>
>>>>>
>>>>> That is correct. Only the first node suffers from deferred GP.
>>>>> If there are large number of callbacks which got added after
>>>>> last available wait head was queued, all those callbacks (except one)
>>>>> can still have a GP assigned to them.
>>>>>
>>>>>> And the next grace period completing this sr (that non-wait-head set
>>>>>> as rcu_state.srs_done_tail) and thus allowing its caller to wipe it out
>>>>>> of its stack may race with the cleanup work dereferencing it?
>>>>>>
>>>>>
>>>>> This sr can only be completed when done tail moves to new node. Till
>>>>> then, it gets deferred continuously. So, we won't be entering into
>>>>> the situation where the sr processing is complete while done tail is pointing
>>>>> to it. Please correct me if I am missing something here.
>>>>
>>>> Ok I'm confused as usual. Let's take a practical case. Is the following
>>>> sequence possible?
>>>>
>>>> 1) wait_tail = NULL
>>>>    done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
>>>>
>>>> Initial layout
>>>>
>>>> 2) wait_tail = SR5 -> WH4...
>>>>    done_tail = WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
>>>>
>>>> New GP
>>>>
>>>> 3) wait_tail = NULL
>>>>    done_tail = SR5->WH4->SR4->WH3->SR3->WH2->SR2->WH1->SR1...
>>>>
>>>> GP completes, normal cleanup
>>>>
>>>> 3) wait_tail = SR6->SR5...
>>>>    done_tail = SR5->WH4->SR4->WH3->SR2->WH2->SR1->WH1->SR1...
>>>>
>>>> New GP
>>>>
>>>> 4) GP completes and SR5 is completed by rcu_sr_normal_gp_cleanup(). So
>>>>    the caller releases it from the stack. But before rcu_sr_normal_gp_cleanup()
>>>>    overwrites done_tail to SR6->WH4->SR4.... , the workqueue manages to run
>>>>    and concurrently dereferences SR5.
>>>>
>>>> But I bet I'm missing something obvious in the middle, preventing that...
>>>
>>> Your analysis looks correct to me. Maybe, one way to fix this can be that
>>> rcu_sr_normal_gp_cleanup() stops processing nodes in its context,
>>> when it reaches done tail and done tail is not a wait head. I will
>>> think more on this, thanks!
>>
>> That alone is probably not enough. In the end you may end up with a real
>> pending sr stuck as done tail without completion, until one day a
>> new real queue comes up, preferably with a real wait head in order not
>> to get stuck with a new sr as done tail.
>>
> 
> But after point 4 in your sequence, rcu_sr_normal_gp_cleanup() would move
> the done tail to SR6 and queue a new work, which will process SR5,
> so, we will be able to progress real pending srs?
> 
> 

Missed one point. We will continue initiating new GP until first node cb
is processed, which can happen when a wait head becomes available.

Thanks
Neeraj

> Thanks
> Neeraj
> 
>>>
>>>
>>> Thanks
>>> Neeraj