Re: [PATCH] x86/PCI: Convert force_disable_hpet() to standard quirk

Zhang Rui <rui.zhang@xxxxxxxxx> · Wed, 02 Dec 2020 15:28:18 +0800

On Mon, 2020-11-30 at 20:21 +0100, Thomas Gleixner wrote:
> Feng,
> 
> On Fri, Nov 27 2020 at 14:11, Feng Tang wrote:
> > On Fri, Nov 27, 2020 at 12:27:34AM +0100, Thomas Gleixner wrote:
> > > On Thu, Nov 26 2020 at 09:24, Feng Tang wrote:
> > > Yes, that can happen. But OTOH, we should start to think about
> > > the
> > > requirements for using the TSC watchdog.

My original proposal is to disable jiffies and refined-jiffies as the
clocksource watchdog, because they are not reliable and it's better to
use clocksource that has a hardware counter as watchdog, like the patch
below, which I didn't sent out for upstream.

>From cf9ce0ecab8851a3745edcad92e072022af3dbd9 Mon Sep 17 00:00:00 2001
From: Zhang Rui <rui.zhang@xxxxxxxxx>
Date: Fri, 19 Jun 2020 22:03:23 +0800
Subject: [RFC PATCH] time/clocksource: do not use refined-jiffies as watchdog

On IA platforms, if HPET is disabled, either via x86 early-quirks, or
via kernel commandline, refined-jiffies will be used as clocksource
watchdog in early boot phase, before acpi_pm timer registered.

This is not a problem if jiffies are accurate.
But in some cases, for example, when serial console is enabled, it may
take several milliseconds to write to the console, with irq disabled,
frequently. Thus many ticks may become longer than it should be.

Using refined-jiffies as watchdog in this case breaks the system because
a) duration calculated by refined-jiffies watchdog is always consistent
   with the watchdog timeout issued using add_timer(), say, around 500ms.
b) duration calculated by the running clocksource, usually TSC on IA
   platforms, reflects the real time cost, which may be much larger.
This results in the running clocksource being disabled erroneously.

This is reproduced on ICL because HPET is disabled in x86 early-quirks,
and also reproduced on a KBL and a WHL platform when HPET is disabled
via command line.

BTW, commit fd329f276eca
("x86/mtrr: Skip cache flushes on CPUs with cache self-snooping") is
another example that refined-jiffies causes the same problem when ticks
become slow for some other reason.

IMO, the right solution is to only use hardware clocksource as watchdog.
Then even if ticks are slow, both the running clocksource and the watchdog
returns real time cost, and they still match.

Signed-off-by: Zhang Rui <rui.zhang@xxxxxxxxx>
---
 kernel/time/clocksource.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 02441ead3c3b..e7e703858fa6 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -364,6 +364,10 @@ static void clocksource_select_watchdog(bool fallback)
 		watchdog = NULL;
 
 	list_for_each_entry(cs, &clocksource_list, list) {
+		/* Do not use refined-jiffies as clocksource watchdog */
+		if (cs->rating <= 2)
+			continue;
+
 		/* cs is a clocksource to be watched. */
 		if (cs->flags & CLOCK_SOURCE_MUST_VERIFY)
 			continue;
-- 
2.17.1

> > > 
> > > I'm inclined to lift that requirement when the CPU has:
> > > 
> > >     1) X86_FEATURE_CONSTANT_TSC
> > >     2) X86_FEATURE_NONSTOP_TSC
> > >     3) X86_FEATURE_NONSTOP_TSC_S3
> > 
> > IIUC, this feature exists for several generations of Atom
> > platforms,
> > and it is always coupled with 1) and 2), so it could be skipped for
> > the checking.
> 
> Yes, we can ignore that bit as it's not widely available and not
> required to solve the problem.
> 
> > >     4) X86_FEATURE_TSC_ADJUST
> > >     
> > >     5) At max. 4 sockets
> > > 
Should we consider some other corner cases like TSC is not used as
clocksource? refined_jiffies watchdog can break any other hardware
clocksource when it becomes inaccurate.

> > > The only reason I hate to disable HPET upfront at least during
> > > boot is
> > > that HPET is the best mechanism for the refined TSC calibration.
> > > PMTIMER
> > > sucks because it's slow and wraps around pretty quick.
> > > 
> > > So we could do the following even on platforms where HPET stops
> > > in some
> > > magic PC? state:
> > > 
> > >   - Register it during early boot as clocksource
> > > 
> > >   - Prevent the enablement as clockevent and the chardev hpet
> > > timer muck
> > > 
> > >   - Prevent the magic PC? state up to the point where the refined
> > >     TSC calibration is finished.
> > > 
> > >   - Unregister it once the TSC has taken over as system
> > > clocksource and
> > >     enable the magic PC? state in which HPET gets disfunctional.
> > 

On the other side, for ICL, the HPET problem is observed on early
samples, and I didn't reproduce the problem on new ones I have.
But I need to confirm internally if it is safe to re-enable it first.

thanks,
rui
> > This looks reasonable to me. 
> > 
> > I have thought about lowering the hpet rating to lower than
> > PMTIMER, so it
> > still contributes in early boot phase, and fades out after PMTIMER
> > is
> > initialised.
> 
> Not a good idea. pm_timer is initialized before the refined
> calibration
> finishes.
> 
> Thanks,
> 
>         tglx