On Sun, Jan 1, 2023 at 12:16 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > On Sat, Dec 31, 2022 at 06:10:40PM -0500, Joel Fernandes wrote: > > On Sat, Dec 31, 2022 at 4:49 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > > > > > > On Sat, Dec 31, 2022 at 11:46 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > [...] > > > > Hmmm... Some of the tasks run at relatively high priority. Maybe they > > > > need to de-prioritize themselves before looping waiting to be stopped. > > > > These loops look like this: > > > > > > > > while (!kthread_should_stop()) { > > > > torture_shutdown_absorb("rcu_torture_boost"); > > > > schedule_timeout_uninterruptible(1); > > > > } > > > > > > Yes, it appears this tight loop is live locked with the timer softirq. > > > I am trying a run with higher timeout to see if it helps. > > > > > > > > > > > Or it might be something else... > > > > > > I see that kthread_should_stop() returns false, but > > > torture_must_stop_irq() returns true in the tight while loop mentioned > > > above. So it seems like the shutdown notifier triggered first. I am > > > seeing various "is stopping" messages. However I see no "End-test" > > > messages, which means I think the torture_shutdown_hook() never ran > > > properly, or something. Anyway now I am doing heavy tracing in > > > rcu_torture_cleanup() to see what it is upto. My suspicion is it did > > > not even call torture_stop_kthread() and we are stuck without the > > > kthreads being stopped. > > > > Now all tests pass always if I do the following change in torture_stopping(): > > > > - schedule_timeout_uninterruptible(1); > > + schedule_timeout_uninterruptible(50); > > > > Current theory is, the timer softirq preempts the cleanup thread > > before it can call kthread_stop(). > > > > Anyway, let me know if this is an acceptable change (or not). I think > > checking for shutdown state 20 times per seconds instead of 1000 times > > per second is kind of reasonable. > > Make that schedule_timeout_uninterruptible(HZ / 20) and sold! Thanks! I'll send a patch shortly. I think I nailed it correctly. The problem is fullstop is set to FULLSTOP_RMMOD , before kthread_stop() is called. This causes all the rcutorture threads to enter the tight loop in kthread_stopping(). Further this can happening in a thundering herd fashion with every thread queueing timers constantly causing the timer softirq to stall a writer which just happened to be executing synchronize. I just did a 100 runs with HZ/20 and all pass. Patch on the way. Thanks, - Joel