On Thu, Nov 10, 2022 at 03:29:04PM +0800, Leizhen (ThunderTown) wrote: > > > On 2022/11/10 1:03, Frederic Weisbecker wrote: > > On Wed, Nov 09, 2022 at 07:59:01AM -0800, Paul E. McKenney wrote: > >> On Wed, Nov 09, 2022 at 04:26:21PM +0100, Frederic Weisbecker wrote: > >>> Hi Zhen Lei, > >>> > >>> On Wed, Nov 09, 2022 at 05:37:36PM +0800, Zhen Lei wrote: > >>>> v5 --> v6: > >>>> 1. When there are more than two continuous RCU stallings, correctly handle the > >>>> value of the second and subsequent sampling periods. Update comments and > >>>> document. > >>>> Thanks to Elliott, Robert for the test. > >>>> 2. Change "rcu stall" to "RCU stall". > >>>> > >>>> v4 --> v5: > >>>> 1. Resolve a git am conflict. No code change. > >>>> > >>>> v3 --> v4: > >>>> 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime. > >>>> > >>>> v2 --> v3: > >>>> 1. Fix the return type of kstat_cpu_irqs_sum() > >>>> 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter > >>>> rcupdate.rcu_cpu_stall_deep_debug. > >>>> 3. Add comments and normalize local variable name > >>>> > >>>> > >>>> v1 --> v2: > >>>> 1. Fixed a bug in the code. If the rcu stall is detected by another CPU, > >>>> kcpustat_this_cpu cannot be used. > >>>> @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu) > >>>> if (r->gp_seq != rdp->gp_seq) > >>>> return; > >>>> > >>>> - cpustat = kcpustat_this_cpu->cpustat; > >>>> + cpustat = kcpustat_cpu(cpu).cpustat; > >>>> 2. Move the start point of statistics from rcu_stall_kick_kthreads() to > >>>> rcu_implicit_dynticks_qs(), removing the dependency on irq_work. > >>>> > >>>> v1: > >>>> In some extreme cases, such as the I/O pressure test, the CPU usage may > >>>> be 100%, causing RCU stall. In this case, the printed information about > >>>> current is not useful. Displays the number and usage of hard interrupts, > >>>> soft interrupts, and context switches that are generated within half of > >>>> the CPU stall timeout, can help us make a general judgment. In other > >>>> cases, we can preliminarily determine whether an infinite loop occurs > >>>> when local_irq, local_bh or preempt is disabled. > >>> > >>> That looks useful but I have to ask: what does it bring that the softlockup > >>> and hardlockup watchdog can not already solve? > >> > >> This is a good point. One possible benefit is putting the needed information > >> in one spot, for example, in cases where the soft/hard lockup timeouts are > >> significantly different than the RCU CPU stall warning timeout. > > > > Arguably, the hardlockup/softlockup detectors usually trigger after RCU stall, > > unless all CPUs are caught into a hardlockup, in which case only the hardlockup > > detector has a chance. > > But not all ARCHs support hardlockup, such as s390. Maybe arm64. > > config HARDLOCKUP_DETECTOR > bool "Detect Hard Lockups" > depends on DEBUG_KERNEL && !S390 > depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_ARCH Ah fair point indeed. Thanks!