Re: [PATCH v5 2/4] rcu: Reduce synchronize_rcu() latency

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Tue, 27 Feb 2024 17:53:41 -0500

On Tue, Feb 27, 2024 at 5:50 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, Feb 27, 2024 at 5:39 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> >
> >
> >
> > On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
> > > A call to a synchronize_rcu() can be optimized from a latency
> > > point of view. Workloads which depend on this can benefit of it.
> > >
> > > The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> > > depends on several factors:
> > >
> > > - how fast a process of offloading is started. Combination of:
> > >     - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
> > >     - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
> > >     - other.
> > > - when started, invoking path is interrupted due to:
> > >     - time limit;
> > >     - need_resched();
> > >     - if limit is reached.
> > > - where in a nocb list it is located;
> > > - how fast previous callbacks completed;
> > >
> > > Example:
> > >
> > > 1. On our embedded devices i can easily trigger the scenario when
> > > it is a last in the list out of ~3600 callbacks:
> > >
> > > <snip>
> > >   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> > > ...
> > >   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> > >   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> > >   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> > >   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> > >   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> > >   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> > >   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> > > <snip>
> > >
> > > 2. We use cpuset/cgroup to classify tasks and assign them into
> > > different cgroups. For example "backgrond" group which binds tasks
> > > only to little CPUs or "foreground" which makes use of all CPUs.
> > > Tasks can be migrated between groups by a request if an acceleration
> > > is needed.
> > >
> [...]
> > >   * Initialize a new grace period.  Return false if no grace period required.
> > >   */
> > > @@ -1432,6 +1711,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > >       unsigned long mask;
> > >       struct rcu_data *rdp;
> > >       struct rcu_node *rnp = rcu_get_root();
> > > +     bool start_new_poll;
> > >
> > >       WRITE_ONCE(rcu_state.gp_activity, jiffies);
> > >       raw_spin_lock_irq_rcu_node(rnp);
> > > @@ -1456,10 +1736,24 @@ static noinline_for_stack bool rcu_gp_init(void)
> > >       /* Record GP times before starting GP, hence rcu_seq_start(). */
> > >       rcu_seq_start(&rcu_state.gp_seq);
> > >       ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > +     start_new_poll = rcu_sr_normal_gp_init();
> > >       trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > >       rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > >       raw_spin_unlock_irq_rcu_node(rnp);
> > >
> > > +     /*
> > > +      * The "start_new_poll" is set to true, only when this GP is not able
> > > +      * to handle anything and there are outstanding users. It happens when
> > > +      * the rcu_sr_normal_gp_init() function was not able to insert a dummy
> > > +      * separator to the llist, because there were no left any dummy-nodes.
> > > +      *
> > > +      * Number of dummy-nodes is fixed, it could be that we are run out of
> > > +      * them, if so we start a new pool request to repeat a try. It is rare
> > > +      * and it means that a system is doing a slow processing of callbacks.
> > > +      */
> > > +     if (start_new_poll)
> > > +             (void) start_poll_synchronize_rcu();
> > > +
> >
> > Optionally, does it make sense to print a warning if too many retries occurred?
>
> For future work, I was wondering about slight modification to even
> avoid this "running out of nodes" issues, why not add a wait node to
> task_struct and use that. rcu_gp_init() can just use that. Then, there
> is no limit to how many callers or to the length of the list. And by
> definition, you cannot have more than 1 caller per task-struct. Would
> that not work?
>
> So in rcu_gp_init(), use the wait node of the first task_struct on the
> top of the list, mark it as a "special node", perhaps with a flag that
> says it is also the dummy node.
>
> But yeah the new design of the patch is really cool... if you are
> leaving it alone without going for above suggestion, I can add it to
> my backlog for future work.

Ouch, let me clarify and sorry for so many messages,  I meant use the
same storage of the synchronous caller who's on top of the list (its
rcu_synchronize node) as the "Dummy" demarkation in the list. Not sure
if that makes sense or it will work, but I'm of the feeling that the
limit of 5 might not be needed.