Re: One potential issue with concurrent execution of RCU callbacks...

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Tue, 8 Dec 2020 10:24:09 -0800

On Tue, Dec 08, 2020 at 06:52:30PM +0100, Frederic Weisbecker wrote:
> On Tue, Dec 08, 2020 at 09:19:27AM -0800, Paul E. McKenney wrote:
> > On Tue, Dec 08, 2020 at 04:54:57PM +0100, Frederic Weisbecker wrote:
> > > On Tue, Dec 08, 2020 at 06:58:10AM -0800, Paul E. McKenney wrote:
> > > > Hello, Frederic,
> > > > 
> > > > Boqun just asked if RCU callbacks ran in BH-disabled context to avoid
> > > > concurrent execution of the same callback.  Of course, this raises the
> > > > question of whether a self-posting callback can have two instances of
> > > > itself running concurrently while a CPU is in the process of transitioning
> > > > between softirq and rcuo invocation of callbacks.
> > > > 
> > > > I believe that the answer is "no" because BH-disabled context is
> > > > an implicit RCU read-side critical section.  Therefore, the initial
> > > > invocation of the RCU callback must complete in order for a new grace
> > > > period to complete, and a new grace period must complete before the
> > > > second invocation of that same callback to start.
> > > > 
> > > > Does that make sense, or am I missing something?
> > > 
> > > Sounds like a good explanation. But then why are we actually calling
> > > the entire rcu_do_batch() under BH-disabled context? Was it to quieten
> > > lockdep against rcu_callback_map ?
> > 
> > Inertia and lack of complaints about it.  ;-)
> > 
> > Plus when called from softirq, neither local_bh_disable() nor
> > rcu_read_lock() is necessary, and so represents pointless overhead.
> > 
> > > Wouldn't rcu_read_lock() around callbacks invocation be enough? Or is
> > > there another reason for the BH-disabled context that I'm missing?
> > 
> > There are likely to be callback functions that use spin_lock() instead
> > of spin_lock_bh() because they know that they are invoked in BH-disabled
> > context.
> 
> Ah right. So perhaps we can keep local_bh_disable() instead.
> 
> > But what does this change help?
> 
> It reduces the code scope running with BH disabled.
> Also narrowing down helps to understand what it actually protects.

I thought that you would call out unnecessarily delaying other softirq
handlers.  ;-)

But if such delays are a problem (and they might well be), then to
avoid them on non-rcu_nocb CPUs would instead/also require changing the
early-exit checks to check for other pending softirqs to the existing
checks involving time, need_resched, and idle.  At which point, entering and
exiting BH-disabled again doesn't help, other than your point about the
difference in BH-disabled scopes on rcu_nocb and non-rcu_nocb CPUs.

Would it make sense to exit rcu_do_batch() if more than some amount
of time had elapsed and there was some non-RCU softirq pending?

My guess is that the current tlimit checks in rcu_do_batch() make this
unnecessary.

Thoughts?

							Thanx, Paul