Re: [PATCH v5 2/4] rcu: Reduce synchronize_rcu() latency

Uladzislau Rezki <urezki@xxxxxxxxx> · Wed, 28 Feb 2024 10:28:11 +0100



On Tue, Feb 27, 2024 at 03:51:03PM -0500, Joel Fernandes wrote:
> 
> 
> On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
> > A call to a synchronize_rcu() can be optimized from a latency
> > point of view. Workloads which depend on this can benefit of it.
> > 
> > The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> > depends on several factors:
> > 
> > - how fast a process of offloading is started. Combination of:
> >     - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
> >     - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
> >     - other.
> > - when started, invoking path is interrupted due to:
> >     - time limit;
> >     - need_resched();
> >     - if limit is reached.
> > - where in a nocb list it is located;
> > - how fast previous callbacks completed;
> > 
> > Example:
> > 
> > 1. On our embedded devices i can easily trigger the scenario when
> > it is a last in the list out of ~3600 callbacks:
> > 
> > <snip>
> >   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> > ...
> >   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> >   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> > <snip>
> > 
> > 2. We use cpuset/cgroup to classify tasks and assign them into
> > different cgroups. For example "backgrond" group which binds tasks
> > only to little CPUs or "foreground" which makes use of all CPUs.
> > Tasks can be migrated between groups by a request if an acceleration
> > is needed.
> > 
> > See below an example how "surfaceflinger" task gets migrated.
> > Initially it is located in the "system-background" cgroup which
> > allows to run only on little cores. In order to speed it up it
> > can be temporary moved into "foreground" cgroup which allows
> > to use big/all CPUs:
> > 
> > cgroup_attach_task():
> >  -> cgroup_migrate_execute()
> >    -> cpuset_can_attach()
> >      -> percpu_down_write()
> >        -> rcu_sync_enter()
> >          -> synchronize_rcu()
> 
> We should do this patch but I wonder also if cgroup_attach_task() usage of
> synchronize_rcu() should actually be using the _expedited() variant (via some
> possible flag to the percpu rwsem / rcu_sync).
> 
> If the user assumes it a slow path, then usage of _expedited() should probably
> be OK. If it is assumed a fast path, then it is probably hurting latency anyway
> without the enablement of this patch's rcu_normal_wake_from_gp.
> 
> Thoughts?
> 
How i see it, the rcu_normal_wake_from_gp is disabled so far. We need to
work on this further to have it on by default. But we will move toward
this.


> Then it becomes a matter of how to plumb the expeditedness down the stack.
> 
> Also, speaking of percpu rwsem, I noticed that percpu refcounts don't use
> rcu_sync. I haven't looked closely why, but something I hope to get time to look
> into is if it can be converted over and what benefits would that entail if any.
> 
> Also will continue reviewing the patch. Thanks.
> 
Thanks.

--
Uladzislau Rezki