On Tue, Feb 27, 2024 at 03:51:03PM -0500, Joel Fernandes wrote: > > > On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote: > > A call to a synchronize_rcu() can be optimized from a latency > > point of view. Workloads which depend on this can benefit of it. > > > > The delay of wakeme_after_rcu() callback, which unblocks a waiter, > > depends on several factors: > > > > - how fast a process of offloading is started. Combination of: > > - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU; > > - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY; > > - other. > > - when started, invoking path is interrupted due to: > > - time limit; > > - need_resched(); > > - if limit is reached. > > - where in a nocb list it is located; > > - how fast previous callbacks completed; > > > > Example: > > > > 1. On our embedded devices i can easily trigger the scenario when > > it is a last in the list out of ~3600 callbacks: > > > > <snip> > > <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28 > > ... > > <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt > > <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=.... > > <snip> > > > > 2. We use cpuset/cgroup to classify tasks and assign them into > > different cgroups. For example "backgrond" group which binds tasks > > only to little CPUs or "foreground" which makes use of all CPUs. > > Tasks can be migrated between groups by a request if an acceleration > > is needed. > > > > See below an example how "surfaceflinger" task gets migrated. > > Initially it is located in the "system-background" cgroup which > > allows to run only on little cores. In order to speed it up it > > can be temporary moved into "foreground" cgroup which allows > > to use big/all CPUs: > > > > cgroup_attach_task(): > > -> cgroup_migrate_execute() > > -> cpuset_can_attach() > > -> percpu_down_write() > > -> rcu_sync_enter() > > -> synchronize_rcu() > > We should do this patch but I wonder also if cgroup_attach_task() usage of > synchronize_rcu() should actually be using the _expedited() variant (via some > possible flag to the percpu rwsem / rcu_sync). > > If the user assumes it a slow path, then usage of _expedited() should probably > be OK. If it is assumed a fast path, then it is probably hurting latency anyway > without the enablement of this patch's rcu_normal_wake_from_gp. > > Thoughts? > How i see it, the rcu_normal_wake_from_gp is disabled so far. We need to work on this further to have it on by default. But we will move toward this. > Then it becomes a matter of how to plumb the expeditedness down the stack. > > Also, speaking of percpu rwsem, I noticed that percpu refcounts don't use > rcu_sync. I haven't looked closely why, but something I hope to get time to look > into is if it can be converted over and what benefits would that entail if any. > > Also will continue reviewing the patch. Thanks. > Thanks. -- Uladzislau Rezki