Re: [PATCH v2 1/3] rcu: Reduce synchronize_rcu() latency

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Wed, 8 Nov 2023 07:24:07 -0800

On Wed, Nov 08, 2023 at 11:56:25AM +0100, Uladzislau Rezki wrote:
> > > 
> > > Do you have something that can easily trigger it? I mean some proposal
> > > or steps to test. Probably i should try what you wrote, regarding
> > > toggling from user space.
> > > 
> > > > I can imagine ways around this, but they are a bit ugly.  They end
> > > > up being things like recording a timestamp on every sysfs change to
> > > > rcu_normal, and then using that timestamp to deduce whether there could
> > > > possibly have been sysfs activity on rcu_normal in the meantime.
> > > > 
> > > > It feels like it should be so easy...  ;-)
> > > > 
> > > Hmm.. Yes it requires more deep analysis :)
> > 
> > Maybe make that WARN_ONCE() condition also test a separate Kconfig
> > option that depends on both DEBUG_KERNEL and RCU_EXPERT?
> > 
> Do you mean to introduce a new Kconfig? For example CONFIG_DEBUG_SRS:
> 
> <snip>
> config DEBUG_SRS
>         bool "Provide debugging asserts for normal synchronize_rcu() call"
>         depends on DEBUG_KERNEL && RCU_EXPERT
>         help
> 	   ...
> <snip>

Yes, in kernel/rcu/Kconfig.debug.  But please use a more self-explanatory
name, keeping in mind that Kconfig options are a global namespace.
Please at least have an "RCU" in there somewhere.  ;-)

> > > > > I was thinking about read_lock()/write_lock() since we have many readers
> > > > > and only one writer. But i do not really like it either.
> > > > 
> > > > This might be a hint that we should have multiple lists, perhaps one
> > > > per CPU.  Or lock contention could be used to trigger the transition
> > > > from a single list to multiple lists. as is done in SRCU and tasks RCU.
> > > >
> > > I do not consider to be a sync call as heavily used as other callbacks
> > > which require several workers to handle, IMHO. From the other hand my
> > > experiments show that to handle 60K-100K by NOCB gives even worse results.
> > > 
> > > > 
> > > > But I bet that there are several ways to make things work.
> > > > 
> > > Right. The main concern with read_lock()/write_lock() is a PREEMPT_RT
> > > kernels where it is a rt-mutex. It would be good to avoid of using any
> > > blocking in the gp-kthread since it is a gp driver.
> > 
> > RCU is pretty low-level, so it is OK with a raw spinlock for the list
> > manipulation.  But only the list manipulation itself.  Perhaps you are
> > worried about lock contention, but in that case, there is also the issue
> > of memory contention for the llist code.
> > 
> I do not consider a lock nor memory contention as an issue here. Whereas
> blocking on rt-mutex in the gp-kthread i consider as "not good to go with".
> raw-spinlocks are OK, but it is a per-cpu or per-node approach which i tend
> to avoid, if not, then probably per-cpu-or-node and merge everything into
> one llist to offload by one worker.

If you have a large enough system and a high enough rate of calls to
synchronize_rcu(), something is going to break.  The current llist
approach will suffer from memory contention and high cache-miss rate,
thus also suffering excessive CPU consumption.  A sleeplock will (as you
say) suffer from excessive blocking.  A spinlock (including raw spinlocks
on PREEMPT_RT) will suffer from excessive spinning and CPU consumption.

Which is why this optimization must continue to be default-off.

I agree that a change to multiple queues, perhaps up to per-CPU queueing,
would be needed to eliminate the possibility of these problems and thus
(hopefully!) make this safe for being a default-on option.  It might
even need to be dynamic, as for SRCU and Tasks RCU.  But neither of
these more-complex options need to be implemented in the initial version.

							Thanx, Paul