Michael Ellerman <mpe@xxxxxxxxxxxxxx> writes: > "Paul E. McKenney" <paulmck@xxxxxxxxxx> writes: >> On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote: >>> Hi >>> >>> I can reproduce it in a ppc virtual cloud server provided by Oregon >>> State University. Following is what I do: >>> 1) curl -l https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz >>> -o linux-5.18-rc1.tar.gz >>> 2) tar zxf linux-5.18-rc1.tar.gz >>> 3) cp config linux-5.18-rc1/.config >>> 4) cd linux-5.18-rc1 >>> 5) make vmlinux -j 8 >>> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot >>> -smp 2 (QEMU 4.2.1) >>> 7) after 12 rounds, the bug got reproduced: >>> (http://154.223.142.244/logs/20220406/qemu.log.txt) >> >> Just to make sure, are you both seeing the same thing? Last I knew, >> Zhouyi was chasing an RCU-tasks issue that appears only in kernels >> built with CONFIG_PROVE_RCU=y, which Miguel does not have set. Or did >> I miss something? >> >> Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period >> kthread slept for three milliseconds, but did not wake up for more than >> 20 seconds. This kthread would normally have awakened on CPU 1, but >> CPU 1 looks to me to be very unhealthy, as can be seen in your console >> output below (but maybe my idea of what is healthy for powerpc systems >> is outdated). Please see also the inline annotations. >> >> Thoughts from the PPC guys? > > I haven't seen it in my testing. But using Miguel's config I can > reproduce it seemingly on every boot. > > For me it bisects to: > > 35de589cb879 ("powerpc/time: improve decrementer clockevent processing") > > Which seems plausible. > > Reverting that on mainline makes the bug go away. > > I don't see an obvious bug in the diff, but I could be wrong, or the old > code was papering over an existing bug? > > I'll try and work out what it is about Miguel's config that exposes > this vs our defconfig, that might give us a clue. It's CONFIG_HIGH_RES_TIMERS=n which triggers the stall. I can reproduce just with: $ make ppc64le_guest_defconfig $ ./scripts/config -d HIGH_RES_TIMERS We have no defconfigs that disable HIGH_RES_TIMERS, I didn't even realise you could disable it TBH :) The Rust CI has it disabled because I copied that from the x86 defconfig they were using back when I added the Rust support. I think that was meant to be a stripped down fast config for CI, but the result is it's just using a badly tested combination which is not helpful. So I'll send a patch to turn HIGH_RES_TIMERS on for the Rust CI, and we can debug this further without blocking them. cheers