Re: [PATCH v5 2/4] rcu: Reduce synchronize_rcu() latency

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Tue, 27 Feb 2024 17:50:58 -0500

On Tue, Feb 27, 2024 at 5:39 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
> > A call to a synchronize_rcu() can be optimized from a latency
> > point of view. Workloads which depend on this can benefit of it.
> >
> > The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> > depends on several factors:
> >
> > - how fast a process of offloading is started. Combination of:
> >     - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
> >     - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
> >     - other.
> > - when started, invoking path is interrupted due to:
> >     - time limit;
> >     - need_resched();
> >     - if limit is reached.
> > - where in a nocb list it is located;
> > - how fast previous callbacks completed;
> >
> > Example:
> >
> > 1. On our embedded devices i can easily trigger the scenario when
> > it is a last in the list out of ~3600 callbacks:
> >
> > <snip>
> >   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> > ...
> >   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> >   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> > <snip>
> >
> > 2. We use cpuset/cgroup to classify tasks and assign them into
> > different cgroups. For example "backgrond" group which binds tasks
> > only to little CPUs or "foreground" which makes use of all CPUs.
> > Tasks can be migrated between groups by a request if an acceleration
> > is needed.
> >
[...]
> >   * Initialize a new grace period.  Return false if no grace period required.
> >   */
> > @@ -1432,6 +1711,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> >       unsigned long mask;
> >       struct rcu_data *rdp;
> >       struct rcu_node *rnp = rcu_get_root();
> > +     bool start_new_poll;
> >
> >       WRITE_ONCE(rcu_state.gp_activity, jiffies);
> >       raw_spin_lock_irq_rcu_node(rnp);
> > @@ -1456,10 +1736,24 @@ static noinline_for_stack bool rcu_gp_init(void)
> >       /* Record GP times before starting GP, hence rcu_seq_start(). */
> >       rcu_seq_start(&rcu_state.gp_seq);
> >       ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > +     start_new_poll = rcu_sr_normal_gp_init();
> >       trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> >       rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> >       raw_spin_unlock_irq_rcu_node(rnp);
> >
> > +     /*
> > +      * The "start_new_poll" is set to true, only when this GP is not able
> > +      * to handle anything and there are outstanding users. It happens when
> > +      * the rcu_sr_normal_gp_init() function was not able to insert a dummy
> > +      * separator to the llist, because there were no left any dummy-nodes.
> > +      *
> > +      * Number of dummy-nodes is fixed, it could be that we are run out of
> > +      * them, if so we start a new pool request to repeat a try. It is rare
> > +      * and it means that a system is doing a slow processing of callbacks.
> > +      */
> > +     if (start_new_poll)
> > +             (void) start_poll_synchronize_rcu();
> > +
>
> Optionally, does it make sense to print a warning if too many retries occurred?

For future work, I was wondering about slight modification to even
avoid this "running out of nodes" issues, why not add a wait node to
task_struct and use that. rcu_gp_init() can just use that. Then, there
is no limit to how many callers or to the length of the list. And by
definition, you cannot have more than 1 caller per task-struct. Would
that not work?

So in rcu_gp_init(), use the wait node of the first task_struct on the
top of the list, mark it as a "special node", perhaps with a flag that
says it is also the dummy node.

But yeah the new design of the patch is really cool... if you are
leaving it alone without going for above suggestion, I can add it to
my backlog for future work.

Thanks.