Hi, On Thu, May 11, 2023 at 8:46 AM Petr Mladek <pmladek@xxxxxxxx> wrote: > > > @@ -111,6 +125,11 @@ static void watchdog_hardlockup_interrupt_count(void) > > > > void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs) > > { > > + if (__this_cpu_read(watchdog_hardlockup_touch)) { > > + __this_cpu_write(watchdog_hardlockup_touch, false); > > + return; > > + } > > If we clear watchdog_hardlockup_touch() here then > watchdog_hardlockup_check() won't be called yet another > watchdog_hrtimer_sample_threshold perior. > > It means that any touch will cause ignoring one full period. > The is_hardlockup() check will be done after full two periods. > > It is not ideal, see below. > > > + > > /* > > * Check for a hardlockup by making sure the CPU's timer > > * interrupt is incrementing. The timer interrupt should have > > diff --git a/kernel/watchdog_perf.c b/kernel/watchdog_perf.c > > index 9be90b2a2ea7..547917ebd5d3 100644 > > --- a/kernel/watchdog_perf.c > > +++ b/kernel/watchdog_perf.c > > @@ -112,11 +98,6 @@ static void watchdog_overflow_callback(struct perf_event *event, > > /* Ensure the watchdog never gets throttled */ > > event->hw.interrupts = 0; > > > > - if (__this_cpu_read(watchdog_nmi_touch) == true) { > > - __this_cpu_write(watchdog_nmi_touch, false); > > - return; > > - } > > The original code looks wrong. arch_touch_nmi_watchdog() caused > skipping only one period of the perf event. > > I would expect that it caused restarting the period, > something like: > > if (__this_cpu_read(watchdog_nmi_touch) == true) { > /* > * Restart the period after which the interrupt > * counter is checked. > */ > __this_cpu_write(nmi_rearmed, 0); > __this_cpu_write(last_timestamp, now); > __this_cpu_write(watchdog_nmi_touch, false); > return; > } > > By other words, we should restart the period in the very next perf > event after the watchdog was touched. > > That said, the new code looks better than the original. > IMHO, the original code was prone to false positives. I had a little bit of a hard time following, but I _think_ the "tl;dr" of all the above is that my change is fine. If I misunderstood, please yell. > Best Regards, > Petr > > PS: It might be worth fixing this problem in a separate patch at the > beginning of this patchset. It might be a candidate for stable > backports. Done. It's now its own patch and early in the series.