Re: RCU stalls with TREE07 on v6.0 kernel

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Sat, 31 Dec 2022 08:07:55 -0800

On Sat, Dec 31, 2022 at 12:28:10AM -0500, Joel Fernandes wrote:
> On Fri, Dec 30, 2022 at 9:14 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > On Fri, Dec 30, 2022 at 08:42:24PM -0500, Joel Fernandes wrote:
> > > Hello,
> > >
> > > I have been firefighting a hang on 6.0.y stable kernels with
> > > rcutorture. It happens mostly consistently when TREE07 is shutting
> > > down.
> > >
> > > It appears that the RCU torture threads are attempted to stop but the
> > > shutdown thread, but are constantly awakened by a timer softirq
> > > handler in ksoftirqd context. When they wake up, they immediately goto
> > > sleep in uninterruptible state until the next time a timer handler
> > > wakes them up. It appears the timer softirq is long enough to cause
> > > RCU stalls and I see it calling 100s of timer function handlers
> > > (call_timer_fn).
> > >
> > > I am doing some more investigation with trace_printk(s):
> > > https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=stable/trace-hang-6.0.y&id=b779b1e92c97f29333a282ee8f548da02f64de2b
> > >
> > > Regarding the timer handlers, I was wondering if it is possible that a
> > > large number of timer handlers constantly queued can cause RCU stalls
> > > due to the timer softirq taking a very long time. That certainly
> > > appears to be the case here. Shouldn't the timer softirq also do
> > > rcu_softirq_qs() similar to the ksoftirq loop, in case there are too
> > > many of them?
> >
> > That is certainly a good thing to try!
> 
> I am trying something like this just for testing, let's see what happens ;-)
> 
> @@ -1788,9 +1796,14 @@ static inline void __run_timers(struct timer_base *base)
> 
>                 while (levels--)
>                         expire_timers(base, heads + levels);

Might you need to put it inside the body of this "while" loop?

> +
> +               rcu_softirq_qs();
>         }
> 
> I guess I am also wondering why the rcu reader does not stop queuing
> timers. It is doing schedule_timeout_interruptible() constantly even
> though the test is being stopped.

As in it is doing schedule_timeout_interruptible() in a loop that does
not check torture_must_stop()?  That would be bad unless there is some
reason the loop cannot just stop.  But if the loop cannot just stop,
that leads me to believe that there is a bug elsewhere.  After all, given
that the test is ending, why shouldn't all the loops be able to just stop?
(Maybe it is waiting on a buggy child task, for example.)

							Thanx, Paul