Re: [PATCH v5 2/4] rcu: Reduce synchronize_rcu() latency

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Wed, 28 Feb 2024 09:32:15 -0500

On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
> A call to a synchronize_rcu() can be optimized from a latency
> point of view. Workloads which depend on this can benefit of it.
> 
> The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> depends on several factors:
> 
> - how fast a process of offloading is started. Combination of:
>     - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
>     - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
>     - other.
> - when started, invoking path is interrupted due to:
>     - time limit;
>     - need_resched();
>     - if limit is reached.
> - where in a nocb list it is located;
> - how fast previous callbacks completed;
> 
> Example:
> 
> 1. On our embedded devices i can easily trigger the scenario when
> it is a last in the list out of ~3600 callbacks:
> 
> <snip>
>   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> ...
>   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
>   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> <snip>
> 
> 2. We use cpuset/cgroup to classify tasks and assign them into
> different cgroups. For example "backgrond" group which binds tasks
> only to little CPUs or "foreground" which makes use of all CPUs.
> Tasks can be migrated between groups by a request if an acceleration
> is needed.
> 
> See below an example how "surfaceflinger" task gets migrated.
> Initially it is located in the "system-background" cgroup which
> allows to run only on little cores. In order to speed it up it
> can be temporary moved into "foreground" cgroup which allows
> to use big/all CPUs:
> 
> cgroup_attach_task():
>  -> cgroup_migrate_execute()
>    -> cpuset_can_attach()
>      -> percpu_down_write()
>        -> rcu_sync_enter()
>          -> synchronize_rcu()
>    -> now move tasks to the new cgroup.
>  -> cgroup_migrate_finish()
> 
> <snip>
>          rcuop/1-29      [000] .....  7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
>     PERFD-SERVER-1855    [000] d..1.  7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
>    TimerDispatch-2768    [002] d..5.  7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
> <snip>
> 
> "Boosting a task" depends on synchronize_rcu() latency:
> 
> - first trace shows a completion of synchronize_rcu();
> - second shows attaching a task to a new group;
> - last shows a final step when migration occurs.
> 
> 3. To address this drawback, maintain a separate track that consists
> of synchronize_rcu() callers only. After completion of a grace period
> users are deferred to a dedicated worker to process requests.
> 
> 4. This patch reduces the latency of synchronize_rcu() approximately
> by ~30-40% on synthetic tests. The real test case, camera launch time,
> shows(time is in milliseconds):
> 
> 1-run 542 vs 489 improvement 9%
> 2-run 540 vs 466 improvement 13%
> 3-run 518 vs 468 improvement 9%
> 4-run 531 vs 457 improvement 13%
> 5-run 548 vs 475 improvement 13%
> 6-run 509 vs 484 improvement 4%
> 
> Synthetic test(no "noise" from other callbacks):
> Hardware: x86_64 64 CPUs, 64GB of memory
> Linux-6.6
> 
> - 10K tasks(simultaneous);
> - each task does(1000 loops)
>      synchronize_rcu();
>      kfree(p);
> 
> default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users;
> patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users.
> 
> Running 60K gives approximately same results on my setup. Please note
> it is without any interaction with another type of callbacks, otherwise
> it will impact a lot a default case.
> 
> 5. By default it is disabled. To enable this perform one of the
> below sequence:
> 
> echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
> 
> Reviewed-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
> Co-developed-by: Neeraj Upadhyay <Neeraj.Upadhyay@xxxxxxx>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
> ---
>  .../admin-guide/kernel-parameters.txt         |  14 +
>  kernel/rcu/tree.c                             | 331 +++++++++++++++++-
>  kernel/rcu/tree_exp.h                         |   2 +-
>  3 files changed, 345 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 2244aa0a013b..7ca84cf7b4f4 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5059,6 +5059,20 @@
>  			delay, memory pressure or callback list growing too
>  			big.
>  
> +	rcutree.rcu_normal_wake_from_gp= [KNL]
> +			Reduces a latency of synchronize_rcu() call. This approach
> +			maintains its own track of synchronize_rcu() callers, so it
> +			does not interact with regular callbacks because it does not
> +			use a call_rcu[_hurry]() path. Please note, this is for a
> +			normal grace period.
> +
> +			How to enable it:
> +
> +			echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> +			or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
> +
> +			Default is 0.
> +
>  	rcuscale.gp_async= [KNL]
>  			Measure performance of asynchronous
>  			grace-period primitives such as call_rcu().
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index c8980d76f402..1328da63c3cd 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -75,6 +75,7 @@
>  #define MODULE_PARAM_PREFIX "rcutree."
>  
>  /* Data structures. */
> +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *);
>  
>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
>  	.gpwrap = true,
> @@ -93,6 +94,8 @@ static struct rcu_state rcu_state = {
>  	.exp_mutex = __MUTEX_INITIALIZER(rcu_state.exp_mutex),
>  	.exp_wake_mutex = __MUTEX_INITIALIZER(rcu_state.exp_wake_mutex),
>  	.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> +	.srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> +		rcu_sr_normal_gp_cleanup_work),
>  };
>  
>  /* Dump rcu_node combining tree at boot to verify correct setup. */
> @@ -1422,6 +1425,282 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  }
[..]
> +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> +{
> +	llist_add((struct llist_node *) &rs->head, &rcu_state.srs_next);
> +}
> +

I'm a bit concerned from a memory order PoV about this llist_add() happening
possibly on a different CPU than the GP thread, and different than the kworker
thread. Basically we can have 3 CPUs simultaneously modifying and reading the
list, but only 2 CPUs have the acq-rel pair AFAICS.

Consider the following situation:

synchronize_rcu() user
----------------------
llist_add the user U - update srs_next list

rcu_gp_init() and rcu_gp_cleanup (SAME THREAD)
--------------------
insert dummy node in front of U, call it S
update wait_tail to U

and then cleanup:
read wait_tail to W
set wait_tail to NULL
set done_tail to W (RELEASE) -- this release ensures U and S are seen by worker.

workqueue handler
-----------------
read done_tail (ACQUIRE)
disconnect rest of list -- disconnected list guaranteed to have U and S,
                           if done_tail read was W.
---------------------------------

So llist_add() does this (assume new_first and new_last are same):

	struct llist_node *first = READ_ONCE(head->first);

	do {
		new_last->next = first;
	} while (!try_cmpxchg(&head->first, &first, new_first));

	return !first;
---

It reads head->first, then writes the new_last->next (call it new_first->next)
to the old first, then sets head->first to the new_first if head->first did not
change in the meanwhile.

The problem I guess happens if the update the head->first is seen *after* the
update to the new_first->next.

This potentially means a corrupted list is seen in the workqueue handler..
because the "U" node is not yet seen pointing to the rest of the list
(previously added nodes), but is already seen the head of the list.

I am not sure if this can happen, but AFAIK try_cmpxchg() doesn't imply ordering
per-se. Maybe that try_cmpxchg() should be a try_cmpxchg_release() in llist_add() ?

thanks,

 - Joel