Re: [PATCH] kernel/watchdog: fix spurious hard lockups

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 20, 2017 at 02:33:09PM -0700, kan.liang@xxxxxxxxx wrote:
> From: Kan Liang <Kan.liang@xxxxxxxxx>
> 
> Some users reported spurious NMI watchdog timeouts.
> 
> We now have more and more systems where the Turbo range is wide enough
> that the NMI watchdog expires faster than the soft watchdog timer that
> updates the interrupt tick the NMI watchdog relies on.
> 
> This problem was originally added by commit 58687acba592
> ("lockup_detector: Combine nmi_watchdog and softlockup detector").
> Previously the NMI watchdog would always check jiffies, which were
> ticking fast enough. But now the backing is quite slow so the expire
> time becomes more sensitive.
> 
> For mainline the right fix is to switch the NMI watchdog to reference
> cycles, which tick always at the same rate independent of turbo mode.
> But this is requires some complicated changes in perf, which are too
> difficult to backport. Since we need a stable fix too just increase the
> NMI watchdog rate here to avoid the spurious timeouts. This is not an
> ideal fix because a 3x as large Turbo range could still fail, but for
> now that's not likely.

As this is an Intel problem, we should at least restrict it to 
arch/x86/kernel/apic/hw_nmi.c.  I don't want to penalize other arches yet.

> 
> Signed-off-by: Kan Liang <Kan.liang@xxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and
> softlockup detector")
> ---
> 
> The right fix for mainline can be found here.
> perf/x86/intel: enable CPU ref_cycles for GP counter
> perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86
> https://patchwork.kernel.org/patch/9779087/
> https://patchwork.kernel.org/patch/9779089/

Does that mean this fix is restricted to just -stable then?  Otherwise I am
confused why we should take this patch, if you have a better fix above.

Cheers,
Don

> 
>  kernel/watchdog_hld.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
> index 54a427d1f344..0f7c6e758b82 100644
> --- a/kernel/watchdog_hld.c
> +++ b/kernel/watchdog_hld.c
> @@ -164,7 +164,7 @@ int watchdog_nmi_enable(unsigned int cpu)
>  		firstcpu = 1;
>  
>  	wd_attr = &wd_hw_attr;
> -	wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh);
> +	wd_attr->sample_period = 3 * hw_nmi_get_sample_period(watchdog_thresh);
>  
>  	/* Try to register using hardware perf events */
>  	event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, watchdog_overflow_callback, NULL);
> -- 
> 2.11.0
> 



[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]