Re: [PATCH 0/3] rcu: Add RCU stall diagnosis information

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Thu, 20 Oct 2022 16:13:53 -0700

On Mon, Oct 17, 2022 at 06:01:05PM +0800, Zhen Lei wrote:
> In some extreme cases, such as the I/O pressure test, the CPU usage may
> be 100%, causing RCU stall. In this case, the printed information about
> current is not useful. Displays the number and usage of hard interrupts,
> soft interrupts, and context switches that are generated within half of
> the CPU stall timeout, can help us make a general judgment. In other
> cases, we can preliminarily determine whether an infinite loop occurs
> when local_irq, local_bh or preempt is disabled.
> 
> Zhen Lei (3):
>   sched: Add helper kstat_cpu_softirqs_sum()
>   sched: Add helper nr_context_switches_cpu()
>   rcu: Add RCU stall diagnosis information

Interesting approach, thank you!

I have pulled this in for testing and review, having rescued it from my
spam folder.

Some questions that might come up include:  (1) Can the addition of
things like cond_resched() make RCU happier with the I/O pressure test?
(2) Should there be a way to turn this off for environments with slow
consoles?  (3) If this information shows heavy CPU usage, what debug
and fix approach should be used?

For an example of #1, if a CPU is flooded with softirq activity, one
might hope that the call to rcu_softirq_qs() would prevent the RCU CPU
stall warning, at least for kernels built with CONFIG_PREEMPT_RT=n.
Similarly, if there are huge numbers of context switches, one might hope
that the rcu_note_context_switch() would report a quiescent state sooner
rather than later.

Thoughts?

							Thanx, Paul

>  include/linux/kernel_stat.h | 12 +++++++++++
>  kernel/rcu/tree.h           | 11 ++++++++++
>  kernel/rcu/tree_stall.h     | 40 +++++++++++++++++++++++++++++++++++++
>  kernel/sched/core.c         |  5 +++++
>  4 files changed, 68 insertions(+)
> 
> -- 
> 2.25.1
>