On 2023/11/21 09:29, Mark Brown wrote: > On Tue, Nov 21, 2023 at 08:58:40AM +0800, Chengming Zhou wrote: >> On 2023/11/21 02:49, Mark Brown wrote: >>> On Thu, Nov 02, 2023 at 03:23:27AM +0000, chengming.zhou@xxxxxxxxx wrote: > >>> When we see problems we see RCU stalls while logging in, for example: > >>> [ 46.453323] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: >>> [ 46.459361] rcu: 3-...0: (1 GPs behind) idle=def4/1/0x40000000 softirq=1304/1304 fqs=951 >>> [ 46.467669] rcu: (detected by 0, t=2103 jiffies, g=1161, q=499 ncpus=4) >>> [ 46.474472] Sending NMI from CPU 0 to CPUs 3: > >> IIUC, here should print the backtrace of CPU 3, right? It looks like CPU 3 is the cause, >> but we couldn't see what it's doing from the log. > > AIUI yes, but it looks like we've just completely lost the CPU - there's > more attempts to talk to it visible in the log: > >>> A full log for that run can be seen at: >>> >>> https://validation.linaro.org/scheduler/job/4017095 > > but none of them appear to cause CPU 3 to respond. Note that 32 bit ARM > is just using a regular IPI rather than something that's actually a NMI > so this isn't hugely out of the ordinary, I'd guess it's stuck with > interrupts masked. Ah yes, there is no NMI on ARM, so CPU 3 maybe running somewhere with interrupts disabled. I searched the full log, but still haven't a clue. And there is no any WARNING or BUG related to SLUB in the log. I wonder how to reproduce it locally with a Qemu VM since I don't have the ARM machine. Thanks!