On 11/21/23 19:21, Mark Brown wrote: > On Tue, Nov 21, 2023 at 11:47:26PM +0800, Chengming Zhou wrote: > >> Ah yes, there is no NMI on ARM, so CPU 3 maybe running somewhere with >> interrupts disabled. I searched the full log, but still haven't a clue. >> And there is no any WARNING or BUG related to SLUB in the log. > > Yeah, nor anything else particularly. I tried turning on some debug > options: > > CONFIG_SOFTLOCKUP_DETECTOR=y > CONFIG_DETECT_HUNG_TASK=y > CONFIG_WQ_WATCHDOG=y > CONFIG_DEBUG_PREEMPT=y > CONFIG_DEBUG_LOCKING=y > CONFIG_DEBUG_ATOMIC_SLEEP=y > > https://validation.linaro.org/scheduler/job/4017828 > > which has some additional warnings related to clock changes but AFAICT > those come from today's -next rather than the debug stuff: > > https://validation.linaro.org/scheduler/job/4017823 > > so that's not super helpful. For the record (and to help debugging focus) on IRC we discussed that with CONFIG_SLUB_CPU_PARTIAL=n the problem persists: https://validation.linaro.org/scheduler/job/4017863 Which limits the scope of where to look so that's good :) >> I wonder how to reproduce it locally with a Qemu VM since I don't have >> the ARM machine. > > There's sample qemu jobs available from for example KernelCI: > > https://storage.kernelci.org/next/master/next-20231120/arm/multi_v7_defconfig/gcc-10/lab-baylibre/baseline-qemu_arm-virt-gicv3.html > > (includes the command line, though it's not using Debian testing like my > test was). Note that I'm testing a bunch of platforms with the same > kernel/rootfs combination and it was only the Raspberry Pi 3 which blew > up. It is a bit tight for memory which might have some influence? > > I'm really suspecting this may have made some underlying platform bug > more obvious :/