Describes how to quickly determine the RCU stall fault type based on the extra output information during CONFIG_RCU_CPU_STALL_CPUTIME=y. Signed-off-by: Zhen Lei <thunder.leizhen@xxxxxxxxxx> --- Documentation/RCU/stallwarn.rst | 56 +++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/Documentation/RCU/stallwarn.rst b/Documentation/RCU/stallwarn.rst index dfa4db8c0931eaf..40748bff8b8186e 100644 --- a/Documentation/RCU/stallwarn.rst +++ b/Documentation/RCU/stallwarn.rst @@ -390,3 +390,59 @@ for example, "P3421". It is entirely possible to see stall warnings from normal and from expedited grace periods at about the same time during the same run. + +RCU_CPU_STALL_CPUTIME +===================== +If CONFIG_RCU_CPU_STALL_CPUTIME=y or rcupdate.rcu_cpu_stall_cputime=1, +some statistics related to interrupts and tasks are shown additionally +as follows: +rcu: hardirqs softirqs csw/system +rcu: number: 624 45 0 +rcu: cputime: 69 1 2425 ==> 2500(ms) + +These statistics are collected in the second half of the rcu stall +timeout. The values in row "number:" are the number of hard interrupts, +number of soft interrupts, and number of context switches. The values in +row "cputime:" are the cputime of hard interrupts, cputime of soft +interrupts, cputime of tasks, and sampling period. Because user-mode tasks +do not cause rcu stall, these tasks can only be kernel tasks, that's why +only the cputime of system are considered. + +The following describes four typical scenarios: +1. A CPU looping with interrupts disabled. + rcu: hardirqs softirqs csw/system + rcu: number: 0 0 0 + rcu: cputime: 0 0 0 ==> 2500(ms) + The start time of the interrupt processing is marked when the handler + is entered, and the end time is marked when the handler is exited. The + cputime of hard interrupts is zero because the current processing time + of current interrupt has not been calculated. Since the irq is disabled, + all other counts must be zero in the second half of rcu stall timeout. + +2. A CPU looping with bottom halves disabled. + Similar to the former, but the number and cputime of hard interrupts + are non-zero. + rcu: hardirqs softirqs csw/system + rcu: number: 624 0 0 + rcu: cputime: 49 0 2446 ==> 2500(ms) + The cputime of system is non-zero, so local_bh_disable() is called in + current task. Otherwise, the cputime of softirqs should be non-zero. + Note, in this case, the number of soft interrupts is always zero. + +3. A CPU looping with preemption disabled. + The number and cputime of hard interrupts and soft interrupts are all + non-zero. Only the number of context switches is zero. + rcu: hardirqs softirqs csw/system + rcu: number: 624 45 0 + rcu: cputime: 69 1 2425 ==> 2500(ms) + +4. No looping, but massive hard and soft interrupts. + rcu: hardirqs softirqs csw/system + rcu: number: xx xx 0 + rcu: cputime: xx xx 0 ==> 2500(ms) + The number and cputime of hard interrupts are all non-zero. The number + of context switches and the cputime of system are zero. The number and + cputime of soft interrupts depends on the cputime of hard interrupts, + either all zeros or all non-zeros. + If it can be reproduced, cat /proc/interrupts or write code to trace + each interrupt by referring to show_interrupts(). -- 2.25.1