On Tue, Jun 20, 2017 at 02:33:09PM -0700, kan.liang@xxxxxxxxx wrote: > From: Kan Liang <Kan.liang@xxxxxxxxx> > > Some users reported spurious NMI watchdog timeouts. > > We now have more and more systems where the Turbo range is wide enough > that the NMI watchdog expires faster than the soft watchdog timer that > updates the interrupt tick the NMI watchdog relies on. > > This problem was originally added by commit 58687acba592 > ("lockup_detector: Combine nmi_watchdog and softlockup detector"). > Previously the NMI watchdog would always check jiffies, which were > ticking fast enough. But now the backing is quite slow so the expire > time becomes more sensitive. > > For mainline the right fix is to switch the NMI watchdog to reference > cycles, which tick always at the same rate independent of turbo mode. > But this is requires some complicated changes in perf, which are too > difficult to backport. Since we need a stable fix too just increase the > NMI watchdog rate here to avoid the spurious timeouts. This is not an > ideal fix because a 3x as large Turbo range could still fail, but for > now that's not likely. As this is an Intel problem, we should at least restrict it to arch/x86/kernel/apic/hw_nmi.c. I don't want to penalize other arches yet. > > Signed-off-by: Kan Liang <Kan.liang@xxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and > softlockup detector") > --- > > The right fix for mainline can be found here. > perf/x86/intel: enable CPU ref_cycles for GP counter > perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86 > https://patchwork.kernel.org/patch/9779087/ > https://patchwork.kernel.org/patch/9779089/ Does that mean this fix is restricted to just -stable then? Otherwise I am confused why we should take this patch, if you have a better fix above. Cheers, Don > > kernel/watchdog_hld.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c > index 54a427d1f344..0f7c6e758b82 100644 > --- a/kernel/watchdog_hld.c > +++ b/kernel/watchdog_hld.c > @@ -164,7 +164,7 @@ int watchdog_nmi_enable(unsigned int cpu) > firstcpu = 1; > > wd_attr = &wd_hw_attr; > - wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh); > + wd_attr->sample_period = 3 * hw_nmi_get_sample_period(watchdog_thresh); > > /* Try to register using hardware perf events */ > event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, watchdog_overflow_callback, NULL); > -- > 2.11.0 >