Re: [PATCH v5 2/4] rcu: Reduce synchronize_rcu() latency

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Wed, 28 Feb 2024 11:44:19 -0500

On 2/28/2024 9:32 AM, Joel Fernandes wrote:
> 
> 
> On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
[...]
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index c8980d76f402..1328da63c3cd 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -75,6 +75,7 @@
>>  #define MODULE_PARAM_PREFIX "rcutree."
>>  
>>  /* Data structures. */
>> +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *);
>>  
>>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
>>  	.gpwrap = true,
>> @@ -93,6 +94,8 @@ static struct rcu_state rcu_state = {
>>  	.exp_mutex = __MUTEX_INITIALIZER(rcu_state.exp_mutex),
>>  	.exp_wake_mutex = __MUTEX_INITIALIZER(rcu_state.exp_wake_mutex),
>>  	.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
>> +	.srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
>> +		rcu_sr_normal_gp_cleanup_work),
>>  };
>>  
>>  /* Dump rcu_node combining tree at boot to verify correct setup. */
>> @@ -1422,6 +1425,282 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>>  }
> [..]
>> +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
>> +{
>> +	llist_add((struct llist_node *) &rs->head, &rcu_state.srs_next);
>> +}
>> +
> 
> I'm a bit concerned from a memory order PoV about this llist_add() happening
> possibly on a different CPU than the GP thread, and different than the kworker
> thread. Basically we can have 3 CPUs simultaneously modifying and reading the
> list, but only 2 CPUs have the acq-rel pair AFAICS.
> 
> Consider the following situation:
> 
> synchronize_rcu() user
> ----------------------
> llist_add the user U - update srs_next list
> 
> rcu_gp_init() and rcu_gp_cleanup (SAME THREAD)
> --------------------
> insert dummy node in front of U, call it S
> update wait_tail to U
> 
> and then cleanup:
> read wait_tail to W
> set wait_tail to NULL
> set done_tail to W (RELEASE) -- this release ensures U and S are seen by worker.
> 
> workqueue handler
> -----------------
> read done_tail (ACQUIRE)
> disconnect rest of list -- disconnected list guaranteed to have U and S,
>                            if done_tail read was W.
> ---------------------------------
> 
> So llist_add() does this (assume new_first and new_last are same):
> 
> 	struct llist_node *first = READ_ONCE(head->first);
> 
> 	do {
> 		new_last->next = first;
> 	} while (!try_cmpxchg(&head->first, &first, new_first));
> 
> 	return !first;
> ---
> 
> It reads head->first, then writes the new_last->next (call it new_first->next)
> to the old first, then sets head->first to the new_first if head->first did not
> change in the meanwhile.
> 
> The problem I guess happens if the update the head->first is seen *after* the
> update to the new_first->next.
> 
> This potentially means a corrupted list is seen in the workqueue handler..
> because the "U" node is not yet seen pointing to the rest of the list
> (previously added nodes), but is already seen the head of the list.
> 
> I am not sure if this can happen, but AFAIK try_cmpxchg() doesn't imply ordering
> per-se. Maybe that try_cmpxchg() should be a try_cmpxchg_release() in llist_add() ?

Everyone in the internal RCU crew corrected me offline that try_cmpxchg() has
full ordering if the cmpxchg succeeded.

So I don't think the issue I mentioned can occur, So we can park this.

Thanks!

 - Joel