On 2022/11/17 6:55, Frederic Weisbecker wrote: > On Fri, Nov 11, 2022 at 09:07:08PM +0800, Zhen Lei wrote: >> +1. A CPU looping with interrupts disabled.:: >> + >> + rcu: hardirqs softirqs csw/system >> + rcu: number: 0 0 0 >> +65;6003;1c rcu: cputime: 0 0 0 ==> 2500(ms) >> + >> + Because interrupts have been disabled throughout the measurement >> + interval, there are no interrupts and no context switches. >> + Furthermore, because CPU time consumption was measured using interrupt >> + handlers, the system CPU consumption is misleadingly measured as zero. >> + This scenario will normally also have "(0 ticks this GP)" printed on >> + this CPU's summary line. >> + >> +2. A CPU looping with bottom halves disabled. >> + >> + This is similar to the previous example, but with non-zero number of >> + and CPU time consumed by hard interrupts, along with non-zero CPU >> + time consumed by in-kernel execution.:: >> + >> + rcu: hardirqs softirqs csw/system >> + rcu: number: 624 0 0 >> + rcu: cputime: 49 0 2446 ==> 2500(ms) >> + >> + The fact that there are zero softirqs gives a hint that these were >> + disabled, perhaps via local_bh_disable(). It is of course possible >> + that there were no softirqs, perhaps because all events that would >> + result in softirq execution are confined to other CPUs. In this case, >> + the diagnosis should continue as shown in the next example. >> + >> +3. A CPU looping with preemption disabled. >> + >> + Here, only the number of context switches is zero.:: >> + >> + rcu: hardirqs softirqs csw/system >> + rcu: number: 624 45 0 >> + rcu: cputime: 69 1 2425 ==> 2500(ms) >> + >> + This situation hints that the stalled CPU was looping with preemption >> + disabled. >> + >> +4. No looping, but massive hard and soft interrupts.:: >> + >> + rcu: hardirqs softirqs csw/system >> + rcu: number: xx xx 0 >> + rcu: cputime: xx xx 0 ==> 2500(ms) >> + >> + Here, the number and CPU time of hard interrupts are all non-zero, >> + but the number of context switches and the in-kernel CPU time consumed >> + are zero. The number and cputime of soft interrupts will usually be >> + non-zero, but could be zero, for example, if the CPU was spinning >> + within a single hard interrupt handler. >> + >> + If this type of RCU CPU stall warning can be reproduced, you can >> + narrow it down by looking at /proc/interrupts or by writing code to >> + trace each interrupt, for example, by referring to show_interrupts(). > > One last question I have. Usually all these informations can be deduced by > just looking at the stacktrace that comes along an RCU stall report. So on > which kind of situation the stacktrace is not enough? Interrupt storm. > > Thanks. > . > -- Regards, Zhen Lei