On Mon, Oct 17, 2022 at 06:01:05PM +0800, Zhen Lei wrote: > In some extreme cases, such as the I/O pressure test, the CPU usage may > be 100%, causing RCU stall. In this case, the printed information about > current is not useful. Displays the number and usage of hard interrupts, > soft interrupts, and context switches that are generated within half of > the CPU stall timeout, can help us make a general judgment. In other > cases, we can preliminarily determine whether an infinite loop occurs > when local_irq, local_bh or preempt is disabled. > > Zhen Lei (3): > sched: Add helper kstat_cpu_softirqs_sum() > sched: Add helper nr_context_switches_cpu() > rcu: Add RCU stall diagnosis information Interesting approach, thank you! I have pulled this in for testing and review, having rescued it from my spam folder. Some questions that might come up include: (1) Can the addition of things like cond_resched() make RCU happier with the I/O pressure test? (2) Should there be a way to turn this off for environments with slow consoles? (3) If this information shows heavy CPU usage, what debug and fix approach should be used? For an example of #1, if a CPU is flooded with softirq activity, one might hope that the call to rcu_softirq_qs() would prevent the RCU CPU stall warning, at least for kernels built with CONFIG_PREEMPT_RT=n. Similarly, if there are huge numbers of context switches, one might hope that the rcu_note_context_switch() would report a quiescent state sooner rather than later. Thoughts? Thanx, Paul > include/linux/kernel_stat.h | 12 +++++++++++ > kernel/rcu/tree.h | 11 ++++++++++ > kernel/rcu/tree_stall.h | 40 +++++++++++++++++++++++++++++++++++++ > kernel/sched/core.c | 5 +++++ > 4 files changed, 68 insertions(+) > > -- > 2.25.1 >