On Sat, Apr 09, 2022 at 12:42:39AM +1000, Michael Ellerman wrote: > Michael Ellerman <mpe@xxxxxxxxxxxxxx> writes: > > "Paul E. McKenney" <paulmck@xxxxxxxxxx> writes: > >> On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote: > >>> Hi > >>> > >>> I can reproduce it in a ppc virtual cloud server provided by Oregon > >>> State University. Following is what I do: > >>> 1) curl -l https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz > >>> -o linux-5.18-rc1.tar.gz > >>> 2) tar zxf linux-5.18-rc1.tar.gz > >>> 3) cp config linux-5.18-rc1/.config > >>> 4) cd linux-5.18-rc1 > >>> 5) make vmlinux -j 8 > >>> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot > >>> -smp 2 (QEMU 4.2.1) > >>> 7) after 12 rounds, the bug got reproduced: > >>> (http://154.223.142.244/logs/20220406/qemu.log.txt) > >> > >> Just to make sure, are you both seeing the same thing? Last I knew, > >> Zhouyi was chasing an RCU-tasks issue that appears only in kernels > >> built with CONFIG_PROVE_RCU=y, which Miguel does not have set. Or did > >> I miss something? > >> > >> Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period > >> kthread slept for three milliseconds, but did not wake up for more than > >> 20 seconds. This kthread would normally have awakened on CPU 1, but > >> CPU 1 looks to me to be very unhealthy, as can be seen in your console > >> output below (but maybe my idea of what is healthy for powerpc systems > >> is outdated). Please see also the inline annotations. > >> > >> Thoughts from the PPC guys? > > > > I haven't seen it in my testing. But using Miguel's config I can > > reproduce it seemingly on every boot. > > > > For me it bisects to: > > > > 35de589cb879 ("powerpc/time: improve decrementer clockevent processing") > > > > Which seems plausible. > > > > Reverting that on mainline makes the bug go away. > > > > I don't see an obvious bug in the diff, but I could be wrong, or the old > > code was papering over an existing bug? > > > > I'll try and work out what it is about Miguel's config that exposes > > this vs our defconfig, that might give us a clue. > > It's CONFIG_HIGH_RES_TIMERS=n which triggers the stall. > > I can reproduce just with: > > $ make ppc64le_guest_defconfig > $ ./scripts/config -d HIGH_RES_TIMERS > > We have no defconfigs that disable HIGH_RES_TIMERS, I didn't even > realise you could disable it TBH :) > > The Rust CI has it disabled because I copied that from the x86 defconfig > they were using back when I added the Rust support. I think that was > meant to be a stripped down fast config for CI, but the result is it's > just using a badly tested combination which is not helpful. > > So I'll send a patch to turn HIGH_RES_TIMERS on for the Rust CI, and we > can debug this further without blocking them. Would it make sense to select HIGH_RES_TIMERS from one of the Kconfig* files in arch/powerpc? Asking for a friend. ;-) Thanx, Paul