Re: RCU stalls with TREE07 on v6.0 kernel

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Sat, 31 Dec 2022 12:14:37 -0500

> On Dec 31, 2022, at 11:52 AM, Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> 
> On Sat, Dec 31, 2022 at 04:18:40PM +0000, Joel Fernandes wrote:
>>> On Sat, Dec 31, 2022 at 4:16 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>>> 
>>> On Sat, Dec 31, 2022 at 4:07 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>>>> 
>>>> On Sat, Dec 31, 2022 at 12:28:10AM -0500, Joel Fernandes wrote:
>>>>> On Fri, Dec 30, 2022 at 9:14 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>>>>>> 
>>>>>> On Fri, Dec 30, 2022 at 08:42:24PM -0500, Joel Fernandes wrote:
>>>>>>> Hello,
>>>>>>> 
>>>>>>> I have been firefighting a hang on 6.0.y stable kernels with
>>>>>>> rcutorture. It happens mostly consistently when TREE07 is shutting
>>>>>>> down.
>>>>>>> 
>>>>>>> It appears that the RCU torture threads are attempted to stop but the
>>>>>>> shutdown thread, but are constantly awakened by a timer softirq
>>>>>>> handler in ksoftirqd context. When they wake up, they immediately goto
>>>>>>> sleep in uninterruptible state until the next time a timer handler
>>>>>>> wakes them up. It appears the timer softirq is long enough to cause
>>>>>>> RCU stalls and I see it calling 100s of timer function handlers
>>>>>>> (call_timer_fn).
>>>>>>> 
>>>>>>> I am doing some more investigation with trace_printk(s):
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=stable/trace-hang-6.0.y&id=b779b1e92c97f29333a282ee8f548da02f64de2b
>>>>>>> 
>>>>>>> Regarding the timer handlers, I was wondering if it is possible that a
>>>>>>> large number of timer handlers constantly queued can cause RCU stalls
>>>>>>> due to the timer softirq taking a very long time. That certainly
>>>>>>> appears to be the case here. Shouldn't the timer softirq also do
>>>>>>> rcu_softirq_qs() similar to the ksoftirq loop, in case there are too
>>>>>>> many of them?
>>>>>> 
>>>>>> That is certainly a good thing to try!
>>>>> 
>>>>> I am trying something like this just for testing, let's see what happens ;-)
>>>>> 
>>>>> @@ -1788,9 +1796,14 @@ static inline void __run_timers(struct timer_base *base)
>>>>> 
>>>>>                while (levels--)
>>>>>                        expire_timers(base, heads + levels);
>>>> 
>>>> Might you need to put it inside the body of this "while" loop?
>>> 
>>> I could but even doing the below did not help. Do you see any reason
>>> putting inside the loop helps? Let me know.
>>> 
>>>>> +               rcu_softirq_qs();
>>>>>        }
>>>>> 
>>>>> I guess I am also wondering why the rcu reader does not stop queuing
>>>>> timers. It is doing schedule_timeout_interruptible() constantly even
>>>>> though the test is being stopped.
>>>> 
>>>> As in it is doing schedule_timeout_interruptible() in a loop that does
>>>> not check torture_must_stop()?  That would be bad unless there is some
>>>> reason the loop cannot just stop.  But if the loop cannot just stop,
>>>> that leads me to believe that there is a bug elsewhere.  After all, given
>>>> that the test is ending, why shouldn't all the loops be able to just stop?
>>> 
>>> Sorry I meant uninterruptible sleep. So
>>> schedule_timeout_uninterruptible(). But I still cannot nail where in
>>> the torture test it is coming from. Now I put a trace_dump_stack()
>>> directly in the schedule loop() and I am hoping it will show:
>>> 
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index c2bb0cd4225d5..d79bc3b91818a 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -6390,6 +6390,8 @@ pick_next_task(struct rq *rq, struct task_struct
>>> *prev, struct rq_flags *rf)
>>> *
>>> * WARNING: must be called with preemption disabled!
>>> */
>>> +static int rcu_tort_ds;
>>> +
>>> static void __sched notrace __schedule(unsigned int sched_mode)
>>> {
>>> struct task_struct *prev, *next;
>>> @@ -6452,6 +6454,11 @@ static void __sched notrace __schedule(unsigned
>>> int sched_mode)
>>> if (prev->sched_contributes_to_load)
>>> rq->nr_uninterruptible++;
>>> + // Dump stack for uninterruptible every 10 stacks.
>>> +   if (prev_state & TASK_UNINTERRUPTIBLE && ((rcu_tort_ds++ % 10) == 0)) {
>>> +       trace_dump_stack(0);
>>> + }
>>> +
>>> /*
>> 
>> Ok found the uninterruptible sleep, it is coming from here:
>> 
>> [ 1359.592979] rcu_tort-160      15d.... 346177086us : <stack trace>
>> [ 1359.592979]  => __ftrace_trace_stack
>> [ 1359.592979]  => __schedule
>> [ 1359.592979]  => schedule
>> [ 1359.592979]  => schedule_timeout
>> [ 1359.592979]  => torture_kthread_stopping
>> [ 1359.592979]  => rcu_torture_reader
>> [ 1359.592979]  => kthread
>> [ 1359.592979]  => ret_from_fork
>> 
>> Thoughts?
> 
> That is door #3, or a variation on it.  That rcu_torture_reader() task
> is waiting to be stopped in torture_kthread_stopping().  The next question
> is "why isn't rcu_torture_cleanup() making it to the 'if (reader_tasks)'
> code that stops the readers?"
> 
> Getting a stack trace of the task executing rcu_torture_cleanup() would
> likely be helpful.  Or just check for the console-log messages that it
> already emits.  These look like "Stopping rcu_torture_writer task".

Thanks for the suggestions, will look into them. I will focus the debug on the kthread stopping part. I think we might have to check for signal flags if we are not already, I found another patch that solved similar issues doing so. And in that patch also it was tight loop around the schedule timeout calls.

Happy new year, over here probably going to take it easy rest of the day.. but this issue is making me curious ;-)

Thanks,

 - Joel 

> 
>                            Thanx, Paul