Re: Normal RCU grace period can be stalled for long because need-resched flags not set?

"Paul E. McKenney" <paulmck@xxxxxxxxxxxxx> · Wed, 3 Jul 2019 09:10:26 -0700

On Wed, Jul 03, 2019 at 11:25:20AM -0400, Joel Fernandes wrote:
> Hi!
> I am measuring performance of the RCU consolidated vs RCU before the
> consolidation of flavors happened (just for fun and may be to talk
> about in a presentation).
> 
> What I did is I limited the readers/writers in rcuperf to run on all
> but one CPU. And then on that one CPU, I had a thread doing a
> preempt-disable + busy-wait + preempt_enable in a loop.

In a CONFIG_PREEMPT=y kernel?  (Guessing so because otherwise
preempt_enable() doesn't do all that much.)

Ah, and CONFIG_NO_HZ_FULL has an effect as well.

> I was hoping the preempt disable busy-wait thread would stall the
> regular readers, and it did.
> But what I noticed is that grace periods take 100-200 milliseconds to
> finish instead of the busy-wait time of 5-10 ms that I set. On closer
> examination, it looks like even though the preempt_enable happens in
> my loop, the need-resched flag is not set even though the grace period
> is long over due. So the thread does not reschedule.

The 100 milliseconds is expected behavior if there is not much of
anything else runnable on the busy-wait CPU, at least in recent
kernels.  So which kernel are you running?  ;-)

And on the need-resched flag not being set, is it possible that it was
set, but was cleared before you looked at it?  After all, the grace
period did end, which means that there was some sort of quiescent state
on the busy-waiting CPU.  And one quiescent state would be a pass
through the scheduler, which would clear the need-resched flag.

> For now, in my test I am just setting the need-resched flag manual
> after a busy wait.

Or are you saying that without your setting need-resched, you are getting
RCU CPU stall warnings?  Depending on exactly what you have in your
busy-wait loop, that might be expected behavior for CONFIG_PREEMPT=n
kernels.

> But I was thinking, can this really happen in real life? So, say a CPU
> is doing a lot of work in preempt_disable but is diligent enough to
> check need-resched flag periodically. I believe some spin-on-owner
> type locking primitives do this.

I believe that RCU handles this correctly.  Of course, after detecting
need-resched, the code must do something that allows the scheduler to
take appropriate action.  One approach is to simply call cond_resched()
periodically, which conveniently combines the need-resched check with
the transfer of control to the scheduler.

> Even though the thread is stalling the grace period, it has no clue
> because no one told it that a GP is in progress that is being held up.
> The tick interrupt for that thread returns rcu_need_deferred_qs()
> returns false during the preempt disable section. Can we do better for
> such usecases, such as even sending an IPI to the CPUs holding the
> Grace period? Or even upgrading the grace period to an expedited one
> if need be?

The tick interrupt will invoke rcu_sched_clock_irq(), which should take
care of things.  Unless this is a CONFIG_NO_HZ_FULL=y kernel, in which a
CPU running in the kernel might never take a scheduling-clock interrupt.
The RCU grace-period kthread checks for this and takes appropriate action
in rcu_implicit_dynticks_qs().

> Expedited grace periods did not have such issues. However I did notice
> that sometimes the Grace period would end not within 1 busy-wait
> duration but within 2. The distribution was strongly bi-modal to
> 1*busy-wait and 2*busy-wait durations for expedited tests. (This
> expedited test actually happened by accident, because the
> preempt-disable in my loop was delaying init enough that the whole
> test was running during init during which synchronize_rcu is upgraded
> to expedited).

I could imagine all sorts of ways that this might happen, but use of
event tracing or ftrace or trace_printk() might be a good next step here.

> I am sorry if this is not a realistic real-life problem, but more a
> "doctor it hurts if I do this" problem as Steven once said ;-)

Within the kernel, there are rules that you are supposed to follow, such
as cond_resched() or similar within long-running loops.  If you break
those rules, stop doing that.  Otherwise, RCU is supposed to handle it.
Within userspace, anything goes, and RCU is supposed to handle it.
Give or take random writes to /dev/mem and similar, anyway.

> I'll keep poking ;-)

Very good!

							Thanx, Paul