On Fri, Jun 23, 2017 at 10:01:55AM +0200, Thomas Gleixner wrote: > On Thu, 22 Jun 2017, Don Zickus wrote: > > On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote: > > > On Wed, 21 Jun 2017, kan.liang@xxxxxxxxx wrote: > > > > We now have more and more systems where the Turbo range is wide enough > > > > that the NMI watchdog expires faster than the soft watchdog timer that > > > > updates the interrupt tick the NMI watchdog relies on. > > > > > > > > This problem was originally added by commit 58687acba592 > > > > ("lockup_detector: Combine nmi_watchdog and softlockup detector"). > > > > Previously the NMI watchdog would always check jiffies, which were > > > > ticking fast enough. But now the backing is quite slow so the expire > > > > time becomes more sensitive. > > > > > > And slapping a factor 3 on the NMI period is the wrong answer to the > > > problem. The simple solution would be to increase the hrtimer frequency, > > > but that's not really desired either. > > > > > > Find an untested patch below, which should cure the issue. > > > > A simple low pass filter. It compiles. :-) I don't think I have knowledge > > to test it. Kan? > > Yes, and it has an interesting twist. It's only working once we have > switched to TSC as clocksource. > > As long as jiffies are the clocksource, this will miserably fail because > when the hrtimer interrupt is not delivered jiffies wont be incremented > either and the NMI will say: Oh. not enough time elapsed. Lather, rinse and > repeat. > > One simple way to fix this is with the delta patch below. Hmm, all this work for a temp fix. Kan, how much longer until the real fix of having perf count the right cycles? Cheers, Don > > Thanks, > > tglx > > 8<-------------------------- > --- a/kernel/watchdog_hld.c > +++ b/kernel/watchdog_hld.c > @@ -72,6 +72,7 @@ EXPORT_SYMBOL(touch_nmi_watchdog); > > #ifdef CONFIG_HARDLOCKUP_CHECK_TIMESTAMP > static DEFINE_PER_CPU(ktime_t, last_timestamp); > +static DEFINE_PER_CPU(unsigned int, nmi_rearmed); > static ktime_t watchdog_hrtimer_sample_threshold __read_mostly; > > void watchdog_update_hrtimer_threshold(u64 period) > @@ -105,8 +106,11 @@ static bool watchdog_check_timestamp(voi > ktime_t delta, now = ktime_get_mono_fast_ns(); > > delta = now - __this_cpu_read(last_timestamp); > - if (delta < watchdog_hrtimer_sample_threshold) > - return false; > + if (delta < watchdog_hrtimer_sample_threshold) { > + if (__this_cpu_inc_return(nmi_rearmed) < 10) > + return false; > + } > + __this_cpu_write(nmi_rearmed, 0); > __this_cpu_write(last_timestamp, now); > return true; > } > >