On Fri, Apr 08, 2022 at 06:02:19PM +0800, Zhouyi Zhou wrote: > On Fri, Apr 8, 2022 at 3:23 PM Michael Ellerman <mpe@xxxxxxxxxxxxxx> wrote: > > > > "Paul E. McKenney" <paulmck@xxxxxxxxxx> writes: > > > On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote: > > >> Hi > > >> > > >> I can reproduce it in a ppc virtual cloud server provided by Oregon > > >> State University. Following is what I do: > > >> 1) curl -l https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz > > >> -o linux-5.18-rc1.tar.gz > > >> 2) tar zxf linux-5.18-rc1.tar.gz > > >> 3) cp config linux-5.18-rc1/.config > > >> 4) cd linux-5.18-rc1 > > >> 5) make vmlinux -j 8 > > >> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot > > >> -smp 2 (QEMU 4.2.1) > > >> 7) after 12 rounds, the bug got reproduced: > > >> (http://154.223.142.244/logs/20220406/qemu.log.txt) > > > > > > Just to make sure, are you both seeing the same thing? Last I knew, > > > Zhouyi was chasing an RCU-tasks issue that appears only in kernels > > > built with CONFIG_PROVE_RCU=y, which Miguel does not have set. Or did > > > I miss something? > > > > > > Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period > > > kthread slept for three milliseconds, but did not wake up for more than > > > 20 seconds. This kthread would normally have awakened on CPU 1, but > > > CPU 1 looks to me to be very unhealthy, as can be seen in your console > > > output below (but maybe my idea of what is healthy for powerpc systems > > > is outdated). Please see also the inline annotations. > > > > > > Thoughts from the PPC guys? > > > > I haven't seen it in my testing. But using Miguel's config I can > > reproduce it seemingly on every boot. > > > > For me it bisects to: > > > > 35de589cb879 ("powerpc/time: improve decrementer clockevent processing") > > > > Which seems plausible. > I also bisect to 35de589cb879 ("powerpc/time: improve decrementer > clockevent processing") Very good! Thank you all!!! Thanx, Paul > > Reverting that on mainline makes the bug go away. > I also revert that on the mainline, and am currently doing a pressure > test (by repeatedly invoking qemu and checking the console.log) on PPC > VM in Oregon State University. > > > > I don't see an obvious bug in the diff, but I could be wrong, or the old > > code was papering over an existing bug? > > > > I'll try and work out what it is about Miguel's config that exposes > > this vs our defconfig, that might give us a clue. > Great job! > > > > cheers > Thanks > Zhouyi