On Tue, Feb 27, 2024 at 5:50 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > > On Tue, Feb 27, 2024 at 5:39 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > > > > > > > > On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote: > > > A call to a synchronize_rcu() can be optimized from a latency > > > point of view. Workloads which depend on this can benefit of it. > > > > > > The delay of wakeme_after_rcu() callback, which unblocks a waiter, > > > depends on several factors: > > > > > > - how fast a process of offloading is started. Combination of: > > > - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU; > > > - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY; > > > - other. > > > - when started, invoking path is interrupted due to: > > > - time limit; > > > - need_resched(); > > > - if limit is reached. > > > - where in a nocb list it is located; > > > - how fast previous callbacks completed; > > > > > > Example: > > > > > > 1. On our embedded devices i can easily trigger the scenario when > > > it is a last in the list out of ~3600 callbacks: > > > > > > <snip> > > > <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28 > > > ... > > > <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt > > > <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt > > > <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt > > > <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt > > > <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt > > > <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt > > > <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=.... > > > <snip> > > > > > > 2. We use cpuset/cgroup to classify tasks and assign them into > > > different cgroups. For example "backgrond" group which binds tasks > > > only to little CPUs or "foreground" which makes use of all CPUs. > > > Tasks can be migrated between groups by a request if an acceleration > > > is needed. > > > > [...] > > > * Initialize a new grace period. Return false if no grace period required. > > > */ > > > @@ -1432,6 +1711,7 @@ static noinline_for_stack bool rcu_gp_init(void) > > > unsigned long mask; > > > struct rcu_data *rdp; > > > struct rcu_node *rnp = rcu_get_root(); > > > + bool start_new_poll; > > > > > > WRITE_ONCE(rcu_state.gp_activity, jiffies); > > > raw_spin_lock_irq_rcu_node(rnp); > > > @@ -1456,10 +1736,24 @@ static noinline_for_stack bool rcu_gp_init(void) > > > /* Record GP times before starting GP, hence rcu_seq_start(). */ > > > rcu_seq_start(&rcu_state.gp_seq); > > > ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq); > > > + start_new_poll = rcu_sr_normal_gp_init(); > > > trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start")); > > > rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap); > > > raw_spin_unlock_irq_rcu_node(rnp); > > > > > > + /* > > > + * The "start_new_poll" is set to true, only when this GP is not able > > > + * to handle anything and there are outstanding users. It happens when > > > + * the rcu_sr_normal_gp_init() function was not able to insert a dummy > > > + * separator to the llist, because there were no left any dummy-nodes. > > > + * > > > + * Number of dummy-nodes is fixed, it could be that we are run out of > > > + * them, if so we start a new pool request to repeat a try. It is rare > > > + * and it means that a system is doing a slow processing of callbacks. > > > + */ > > > + if (start_new_poll) > > > + (void) start_poll_synchronize_rcu(); > > > + > > > > Optionally, does it make sense to print a warning if too many retries occurred? > > For future work, I was wondering about slight modification to even > avoid this "running out of nodes" issues, why not add a wait node to > task_struct and use that. rcu_gp_init() can just use that. Then, there > is no limit to how many callers or to the length of the list. And by > definition, you cannot have more than 1 caller per task-struct. Would > that not work? > > So in rcu_gp_init(), use the wait node of the first task_struct on the > top of the list, mark it as a "special node", perhaps with a flag that > says it is also the dummy node. > > But yeah the new design of the patch is really cool... if you are > leaving it alone without going for above suggestion, I can add it to > my backlog for future work. Ouch, let me clarify and sorry for so many messages, I meant use the same storage of the synchronous caller who's on top of the list (its rcu_synchronize node) as the "Dummy" demarkation in the list. Not sure if that makes sense or it will work, but I'm of the feeling that the limit of 5 might not be needed.