Re: Normal RCU grace period can be stalled for long because need-resched flags not set?

"Paul E. McKenney" <paulmck@xxxxxxxxxxxxx> · Sat, 6 Jul 2019 11:21:17 -0700

On Sat, Jul 06, 2019 at 08:02:30AM -0400, Joel Fernandes wrote:
> On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote:
> [snip] 
> > > I did have a request, could you help me understand why is the grace period
> > > duration double that of my busy wait time? You mentioned this has something
> > > to do with the thread not waking up before another GP is started. But I did
> > > not follow this. Thanks a lot!!
> > 
> > Let's look at a normal grace period, and assume the ->gp_seq grace-period
> > sequence counter is initially 0x1, so that a grace period is already
> > in progress (keep in mind that the low-order two bits of ->gp_seq are
> > the phase within the grace period and the rest of the bits are the
> > grace-period number, so we are in phase 1 of the grace period following
> > grace period #0).  This grace period was started via synchronize_rcu()
> > by Task A.
> > 
> > Then we have the following sequence of events:
> > 
> > o	Task B does call_rcu(), but is too late to use the already-started
> > 	grace period, so it needs another one.	The ->gp_seq_needed
> > 	counter thus becomes 0x8.
> > 
> > o	The current grace period completes, so that the ->gp_seq
> > 	counter becomes 0x4.  Callback invocation commences.
> > 
> > o	The grace-period kthread notices that ->gp_seq_needed is greater
> > 	than ->gp_seq, so it starts a new grace period.  The ->gp_seq
> > 	counter therefore becomes 0x5.
> > 
> > o	The callback for Task A's synchronize_rcu() is invoked, awakening
> > 	Task A.  This happens almost immediately after the new grace
> > 	period starts, but does definitely happen -after- that grace
> > 	period starts.
> 
> Yes, but at this point, we are still at the 1GP mark. But the distribution's
> median below shows a strong correlation with 2 preempt-disable durations and
> grace-period completion. For example, if the duration of the preempt
> disable/enable loop is 50ms, it strongly shows the writer's median time
> difference before and after the synchronize_rcu as 100ms, not really showing
> it is 60 ms or 75 ms or anything.
> 
> > 
> > o	Task A, being part of rcuperf, almost immediately does another
> > 	synchronize_rcu().  So ->gp_seq_needed becomes 0xc.
> 
> Yes, but before that another synchronize_rcu, it also gets the timestamp and
> does a diff between old/new timestamp and has captured the data, so at that
> point the data captured should only be for around 1GPs worth give or take.

But in this case, the "give or take" is almost a full grace period, for
a total of almost two grace periods.

> > If you play out the rest of this sequence, you should see how Task A
> > waits for almost two full grace periods.
> 
> I tried to play this out, still didn't get it :-|. Not that it is an issue
> per-se, but still would like to understand it better.

Let's try it a different way.  Let's skip the state portion of ->gp_seq
and just focus on the number of elapsed grace periods.

o	Let's start with RCU idle, having completed grace-period number
	zero and not yet having started grace-period number 1.
	Let's also abbreviate: GP0 and GP1.

o	Task A wants a grace period.  Because RCU is idle, Task A's
	GP will complete at completion of GP1.

o	RCU starts GP1.

o	Almost immediately thereafter, Task B wants a grace period.
	But it cannot use GP1, because GP1 has already started, even
	if just barely, because some CPU or tasks might already have
	reported a quiescent state for GP1.

	So Task B's GP will not complete until the end of GP2.	And that
	is almost two GPs worth of time from now.

And on a system where RCU is busy, Task B's experience is the common
case if it is in a tight loop doing synchronize_rcu().  Because RCU is
busy, by the end of each grace period, there will always be a request
for another grace period.  And therefore RCU's grace-period kthread
will immediately start a new grace period upon completion of each
grace period.  So the sequence of events from Task B's viewpoint will
be as follows:

o	Task B executes synchronize_rcu(), which requests a new grace
	period.

o	This requested grace period ends, and another immediately starts.

o	Task B's RCU callback is invoked, which does a wakeup.

o	Task B executes another synchronize_rcu(), but just after a new
	grace period has started.  Task B thus has to wait almost two
	full grace periods.

This could be perceived as a problem, but my question back to you would
be "Why on earth are you invoking synchronize_rcu() in a tight loop???"
"Oh, because you are running rcuperf?  Well, that is why rcuperf uses
ftrace data!"

Which might be the point on which we are talking past each other.  You
might have been thinking of the latency of RCU's grace-period computation.
I was thinking of the length of time between the call to synchronize_rcu()
and the corresponding return.

Does that help, or am I missing your point?

							Thanx, Paul

> > > Performance changes in consolidated vs regular
> > > -------------------------------------------
> > > I ran a thread on a reserved CPU doing preempt disable + busy wait + preempt enable
> > > in a loop and measured the difference in rcuperf between conslidated and regular.
> > > nreaders = nwriters = 10.
> > > 
> > > 		(preempt disable duration)
> > > 		5ms	10ms	20ms	50ms
> > > v4.19
> > > median (usecs)	12000.3	12001	11000	12000
> > > 
> > > v5.1 (deferred)
> > > median (usecs)	13000	19999   40000   100000
> > > 
> > > All of this is still within spec of RCU.
> 
> thanks!
> 
>  - Joel
>