Re: [PATCH V2] kernel/watchdog: fix spurious hard lockups

Andi Kleen <ak@xxxxxxxxxxxxxxx> · Wed, 28 Jun 2017 13:14:04 -0700

On Wed, Jun 28, 2017 at 03:00:08PM -0400, Don Zickus wrote:
> On Tue, Jun 27, 2017 at 04:48:22PM -0700, Andi Kleen wrote:
> > > I haven't heard back any test result yet.
> > > 
> > > The above patch looks good to me.
> > 
> > This needs performance testing.  It may slow down performance or latency sensitive workloads.
> 
> More motivation to work through the issues with the proposed real fix? :-)
> 
> > 
> > > Which workaround do you prefer, the above one or the one checking timestamp?
> > 
> > I prefer the earlier patch, it has far less risk of performance issues.
> 
> But now you are slowing down the nmi_watchdog so much that the
> watchdog_thresh hold becomes meaningless, no? (granted the turbo-mode blows
> it out of the water too)  So now folks who depend on the 10/5/1/whatever second
> reliability lose that.  I think that might be unfair too.

What do you mean with reliability? If you need guarantees of resetting
you always need another separate hardware watchdog (like the TCO watchdog),
as the CPU could be hung up enough that even the NMI watchdog is not 
functional anymore.

So relying solely on the NMI watchdog doesn't make any sense.

It can be a useful debugging tool for a specific class of bugs: 
when kernel software is looping forever.

But if that happens does it really matter how many iterations the
loop does before it is stopped?

Even the current timeout is essentially eternity in CPU time, and 3x
eternity is still eternity.

> The hrtimer increase maintains that and just adds a few more
> interrupts/second.

Interruptions are a big deal for many people.

-Andi