On Wed, Apr 13, 2022 at 04:10:02PM +1000, Nicholas Piggin wrote: > Oops, fixed subject... > > Excerpts from Nicholas Piggin's message of April 13, 2022 3:11 pm: > > +Daniel, Thomas, Viresh > > > > Subject: Re: rcu_sched self-detected stall on CPU > > > > Excerpts from Michael Ellerman's message of April 9, 2022 12:42 am: > >> Michael Ellerman <mpe@xxxxxxxxxxxxxx> writes: > >>> "Paul E. McKenney" <paulmck@xxxxxxxxxx> writes: > >>>> On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote: > >>>>> Hi > >>>>> > >>>>> I can reproduce it in a ppc virtual cloud server provided by Oregon > >>>>> State University. Following is what I do: > >>>>> 1) curl -l https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz > >>>>> -o linux-5.18-rc1.tar.gz > >>>>> 2) tar zxf linux-5.18-rc1.tar.gz > >>>>> 3) cp config linux-5.18-rc1/.config > >>>>> 4) cd linux-5.18-rc1 > >>>>> 5) make vmlinux -j 8 > >>>>> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot > >>>>> -smp 2 (QEMU 4.2.1) > >>>>> 7) after 12 rounds, the bug got reproduced: > >>>>> (http://154.223.142.244/logs/20220406/qemu.log.txt) > >>>> > >>>> Just to make sure, are you both seeing the same thing? Last I knew, > >>>> Zhouyi was chasing an RCU-tasks issue that appears only in kernels > >>>> built with CONFIG_PROVE_RCU=y, which Miguel does not have set. Or did > >>>> I miss something? > >>>> > >>>> Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period > >>>> kthread slept for three milliseconds, but did not wake up for more than > >>>> 20 seconds. This kthread would normally have awakened on CPU 1, but > >>>> CPU 1 looks to me to be very unhealthy, as can be seen in your console > >>>> output below (but maybe my idea of what is healthy for powerpc systems > >>>> is outdated). Please see also the inline annotations. > >>>> > >>>> Thoughts from the PPC guys? > >>> > >>> I haven't seen it in my testing. But using Miguel's config I can > >>> reproduce it seemingly on every boot. > >>> > >>> For me it bisects to: > >>> > >>> 35de589cb879 ("powerpc/time: improve decrementer clockevent processing") > >>> > >>> Which seems plausible. > >>> > >>> Reverting that on mainline makes the bug go away. > >>> > >>> I don't see an obvious bug in the diff, but I could be wrong, or the old > >>> code was papering over an existing bug? > >>> > >>> I'll try and work out what it is about Miguel's config that exposes > >>> this vs our defconfig, that might give us a clue. > >> > >> It's CONFIG_HIGH_RES_TIMERS=n which triggers the stall. > >> > >> I can reproduce just with: > >> > >> $ make ppc64le_guest_defconfig > >> $ ./scripts/config -d HIGH_RES_TIMERS > >> > >> We have no defconfigs that disable HIGH_RES_TIMERS, I didn't even > >> realise you could disable it TBH :) > >> > >> The Rust CI has it disabled because I copied that from the x86 defconfig > >> they were using back when I added the Rust support. I think that was > >> meant to be a stripped down fast config for CI, but the result is it's > >> just using a badly tested combination which is not helpful. > >> > >> So I'll send a patch to turn HIGH_RES_TIMERS on for the Rust CI, and we > >> can debug this further without blocking them. > > > > So we traced the problem down to possibly a misunderstanding between > > decrementer clock event device and core code. > > > > The decrementer is only oneshot*ish*. It actually needs to either be > > reprogrammed or shut down otherwise it just continues to cause > > interrupts. > > > > Before commit 35de589cb879, it was sort of two-shot. The initial > > interrupt at the programmed time would set its internal next_tb variable > > to ~0 and call the ->event_handler(). If that did not set_next_event or > > stop the timer, the interrupt will fire again immediately, notice > > next_tb is ~0, and only then stop the decrementer interrupt. > > > > So that was already kind of ugly, this patch just turned it into a hang. > > > > The problem happens when the tick is stopped with an event still > > pending, then tick_nohz_handler() is called, but it bails out because > > tick_stopped == 1 so the device never gets programmed again, and so it > > keeps firing. > > > > How to fix it? Before commit a7cba02deced, powerpc's decrementer was > > really oneshot, but we would like to avoid doing that because it requires > > additional programming of the hardware on each timer interrupt. We have > > the ONESHOT_STOPPED state which seems to be just about what we want. > > > > Did the ONESHOT_STOPPED patch just miss this case, or is there a reason > > we don't stop it here? This patch seems to fix the hang (not heavily > > tested though). This looks plausible to me based on my interactions with ticks, but it would be good to have someone who understands that code better than I do to look it over. Thanx, Paul > > Thanks, > > Nick > > > > --- > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > index 2d76c91b85de..7e13a55b6b71 100644 > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -1364,9 +1364,11 @@ static void tick_nohz_handler(struct clock_event_device *dev) > > tick_sched_do_timer(ts, now); > > tick_sched_handle(ts, regs); > > > > - /* No need to reprogram if we are running tickless */ > > - if (unlikely(ts->tick_stopped)) > > + if (unlikely(ts->tick_stopped)) { > > + /* If we are tickless, change the clock event to stopped */ > > + tick_program_event(KTIME_MAX, 1); > > return; > > + } > > > > hrtimer_forward(&ts->sched_timer, now, TICK_NSEC); > > tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1); > >