Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

Dongli Zhang <dongli.zhang@xxxxxxxxxx> · Thu, 19 Oct 2023 16:11:12 -0700

Hi Vitaly, Sean and David,

On 10/19/23 08:40, Sean Christopherson wrote:
> On Thu, Oct 19, 2023, Vitaly Kuznetsov wrote:
>> Dongli Zhang <dongli.zhang@xxxxxxxxxx> writes:
>>
>>> As mentioned in the linux kernel development document, "sched_clock() is
>>> used for scheduling and timestamping". While there is a default native
>>> implementation, many paravirtualizations have their own implementations.
>>>
>>> About KVM, it uses kvm_sched_clock_read() and there is no way to only
>>> disable KVM's sched_clock. The "no-kvmclock" may disable all
>>> paravirtualized kvmclock features.
> 
> ...
> 
>>> Please suggest and comment if other options are better:
>>>
>>> 1. Global param (this RFC patch).
>>>
>>> 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).
>>>
>>> Indeed I like the 2nd method.
>>>
>>> 3. Enforce native sched_clock only when TSC is invariant (hyper-v method).
>>>
>>> 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
>>> all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
>>> want to keep the pv sched_clock.
>>
>> Normally, it should be up to the hypervisor to tell the guest which
>> clock to use, i.e. if TSC is reliable or not. Let me put my question
>> this way: if TSC on the particular host is good for everything, why
>> does the hypervisor advertises 'kvmclock' to its guests?
> 
> I suspect there are two reasons.
> 
>   1. As is likely the case in our fleet, no one revisited the set of advertised
>      PV features when defining the VM shapes for a new generation of hardware, or
>      whoever did the reviews wasn't aware that advertising kvmclock is actually
>      suboptimal.  All the PV clock stuff in KVM is quite labyrinthian, so it's
>      not hard to imagine it getting overlooked.
> 
>   2. Legacy VMs.  If VMs have been running with a PV clock for years, forcing
>      them to switch to a new clocksource is high-risk, low-reward.
> 
>> If for some 'historical reasons' we can't revoke features we can always
>> introduce a new PV feature bit saying that TSC is preferred.
>>
>> 1) Global param doesn't sound like a good idea to me: chances are that
>> people will be setting it on their guest images to workaround problems
>> on one hypervisor (or, rather, on one public cloud which is too lazy to
>> fix their hypervisor) while simultaneously creating problems on another.
>>
>> 2) KVM specific parameter can work, but as KVM's sched_clock is the same
>> as kvmclock, I'm not convinced it actually makes sense to separate the
>> two. Like if sched_clock is known to be bad but TSC is good, why do we
>> need to use PV clock at all? Having a parameter for debugging purposes
>> may be OK though...
>>
>> 3) This is Hyper-V specific, you can see that it uses a dedicated PV bit
>> (HV_ACCESS_TSC_INVARIANT) and not the architectural
>> CPUID.80000007H:EDX[8]. I'm not sure we can blindly trust the later on
>> all hypervisors.
>>
>> 4) Personally, I'm not sure that relying on 'TSC is crap' detection is
>> 100% reliable. I can imagine cases when we can't detect that fact that
>> while synchronized across CPUs and not going backwards, it is, for
>> example, ticking with an unstable frequency and PV sched clock is
>> supposed to give the right correction (all of them are rdtsc() based
>> anyways, aren't they?).
> 
> Yeah, practically speaking, the only thing adding a knob to turn off using PV
> clocks for sched_clock will accomplish is creating an even bigger matrix of
> combinations that can cause problems, e.g. where guests end up using kvmclock
> timekeeping but not scheduling.
> 
> The explanation above and the links below fail to capture _the_ key point:
> Linux-as-a-guest already prioritizes the TSC over paravirt clocks as the clocksource
> when the TSC is constant and nonstop (first spliced blob below).
> 
> What I suggested is that if the TSC is chosen over a PV clock as the clocksource,
> then we have the kernel also override the sched_clock selection (second spliced
> blob below).
> 
> That doesn't require the guest admin to opt-in, and doesn't create even more
> combinations to support.  It also provides for a smoother transition for when
> customers inevitably end up creating VMs on hosts that don't advertise kvmclock
> (or any PV clock).

I would prefer to always leave the option to allow the guest admin to change the
decision, especially for diagnostic/workaround reason (although the kvmclock is
always buggy when tsc is buggy).


As a summary of discussion:

1. Vitaly Kuznetsov prefers global param, e.g., for the easy deployment of the
same guest image on different hypervisors.

2. Sean Christopherson prefers an automatic change of sched_clock when
clocksource is or not TSC.


However, the clocksource and TSC are different concepts.

1. The clocksource is an arch global concept. That is, all archs (e.g., x86,
arm, mips) share the same implementation to register/select clocksource. In
additon, something like HPET does not have sched_clock.

2. Some architecture has its own sched_clock implementation. E.g., x86 has its
own sched_clock implementation in arch/x86/kernel/tsc.c.

309 notrace u64 sched_clock(void)
310 {
311         u64 now;
312         preempt_disable_notrace();
313         now = sched_clock_noinstr();
314         preempt_enable_notrace();
315         return now;
316 }

3. When !CONFIG_PARAVIRT, it is native_sched_clock().

4. When CONFIG_PARAVIRT, it is sched_clock_noinstr()->paravirt_sched_clock()
referring to paravirt specific implementation (native/kvm/xen/vmware/hyperv).

That is, the pv sched_clock is a concept under x86 when CONFIG_PARAVIRT==true.


Although the implementation is possible, I just do not like the idea to change
some arch global code, to accommodate some requirement as a leaf of the tree.


How about to keep the change at x86 as in below? It won't work unless I change
'tsc_clocksource_reliable' to an early_param.

---
 arch/x86/include/asm/paravirt.h |  2 +-
 arch/x86/kernel/kvmclock.c      | 12 +++++++-----
 arch/x86/kernel/paravirt.c      | 16 +++++++++++++++-
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6c8ff12..118b793 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -24,7 +24,7 @@
 DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
 DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);

-void paravirt_set_sched_clock(u64 (*func)(void));
+bool paravirt_set_sched_clock(u64 (*func)(void));

 static __always_inline u64 paravirt_sched_clock(void)
 {
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f5214..0b8bf56 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -93,13 +93,15 @@ static noinstr u64 kvm_sched_clock_read(void)

 static inline void kvm_sched_clock_init(bool stable)
 {
-	if (!stable)
-		clear_sched_clock_stable();
 	kvm_sched_clock_offset = kvm_clock_read();
-	paravirt_set_sched_clock(kvm_sched_clock_read);

-	pr_info("kvm-clock: using sched offset of %llu cycles",
-		kvm_sched_clock_offset);
+	if (!paravirt_set_sched_clock(kvm_sched_clock_read)) {
+		if (!stable)
+			clear_sched_clock_stable();
+
+		pr_info("kvm-clock: using sched offset of %llu cycles",
+			kvm_sched_clock_offset);
+	}

 	BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
 		sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 97f1436..f8ad521 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -118,9 +118,23 @@ static u64 native_steal_clock(int cpu)
 DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
 DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);

-void paravirt_set_sched_clock(u64 (*func)(void))
+bool paravirt_set_sched_clock(u64 (*func)(void))
 {
+	if (tsc_clocksource_reliable)
+		goto refuse;
+
+	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+	    !check_tsc_unstable())
+		goto refuse;
+
 	static_call_update(pv_sched_clock, func);
+
+	return 0;
+
+refuse:
+	pr_info("sched_clock: use native when TSC is reliable");
+	return -EPERM;
 }

 /* These are in entry.S */



Indeed my favorite is to keep within kvmclock.
(This won't work until I turn 'tsc_clocksource_reliable' into early_param).

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fb8f5214..f16655d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -286,6 +286,7 @@ static int kvmclock_setup_percpu(unsigned int cpu)

 void __init kvmclock_init(void)
 {
+       bool prefer_tsc;
        u8 flags;

        if (!kvm_para_available() || !kvmclock)
@@ -313,19 +314,8 @@ void __init kvmclock_init(void)
        if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
                pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);

-       flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
-       kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
-
-       x86_platform.calibrate_tsc = kvm_get_tsc_khz;
-       x86_platform.calibrate_cpu = kvm_get_tsc_khz;
-       x86_platform.get_wallclock = kvm_get_wallclock;
-       x86_platform.set_wallclock = kvm_set_wallclock;
-#ifdef CONFIG_X86_LOCAL_APIC
-       x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
-#endif
-       x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
-       x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
-       kvm_get_preset_lpj();
+       if (tsc_clocksource_reliable)
+               prefer_tsc = true;

        /*
         * X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate
@@ -334,10 +324,31 @@ void __init kvmclock_init(void)
         * Invariant TSC exposed by host means kvmclock is not necessary:
         * can use TSC as clocksource.
         *
+        * The TSC is used also when tsc_clocksource_reliable is configured
+        * in kernel command line on purpose.
         */
        if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
            boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
            !check_tsc_unstable())
+               prefer_tsc = true;
+
+       if (!prefer_tsc) {
+               flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
+               kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+       }
+
+       x86_platform.calibrate_tsc = kvm_get_tsc_khz;
+       x86_platform.calibrate_cpu = kvm_get_tsc_khz;
+       x86_platform.get_wallclock = kvm_get_wallclock;
+       x86_platform.set_wallclock = kvm_set_wallclock;
+#ifdef CONFIG_X86_LOCAL_APIC
+       x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
+#endif
+       x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
+       x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
+       kvm_get_preset_lpj();
+
+       if (prefer_tsc)
                kvm_clock.rating = 299;

        clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);


Thank you very much!

Dongli Zhang

> 
>>> To introduce a param may be easier to backport to old kernel version.
>>>
>>> References:
>>> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@xxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!Omk8Q6d8PW-UcKNdCRAeA8qSb698y3Eier2hro5vporwTCHqHSmYYk8fCinciVOHUG40CK4GQpHsjNlDiQ$ 
>>> [2] https://urldefense.com/v3/__https://lore.kernel.org/all/20231018195638.1898375-1-seanjc@xxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!Omk8Q6d8PW-UcKNdCRAeA8qSb698y3Eier2hro5vporwTCHqHSmYYk8fCinciVOHUG40CK4GQpHh5avzQg$ 
>>> [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20231002211651.GA3774@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!Omk8Q6d8PW-UcKNdCRAeA8qSb698y3Eier2hro5vporwTCHqHSmYYk8fCinciVOHUG40CK4GQpH74It6kQ$ 
> 
> On Mon, Oct 2, 2023 at 11:18 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>>> Do we need to update the documentation to always suggest TSC when it is
>>> constant, as I believe many users still prefer pv clock than tsc?
>>>
>>> Thanks to tsc ratio scaling, the live migration will not impact tsc.
>>>
>>> >From the source code, the rating of kvm-clock is still higher than tsc.
>>>
>>> BTW., how about to decrease the rating if guest detects constant tsc?
>>>
>>> 166 struct clocksource kvm_clock = {
>>> 167         .name   = "kvm-clock",
>>> 168         .read   = kvm_clock_get_cycles,
>>> 169         .rating = 400,
>>> 170         .mask   = CLOCKSOURCE_MASK(64),
>>> 171         .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
>>> 172         .enable = kvm_cs_enable,
>>> 173 };
>>>
>>> 1196 static struct clocksource clocksource_tsc = {
>>> 1197         .name                   = "tsc",
>>> 1198         .rating                 = 300,
>>> 1199         .read                   = read_tsc,
>>
>> That's already done in kvmclock_init().
>>
>>         if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
>>             boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
>>             !check_tsc_unstable())
>>                 kvm_clock.rating = 299;
>>
>> See also: https://urldefense.com/v3/__https://lore.kernel.org/all/ZOjF2DMBgW*2FzVvL3@google.com__;JQ!!ACWV5N9M2RV99hQ!Omk8Q6d8PW-UcKNdCRAeA8qSb698y3Eier2hro5vporwTCHqHSmYYk8fCinciVOHUG40CK4GQpFjD9PZNg$ 
>>
>>> 2. The sched_clock.
>>>
>>> The scheduling is impacted if there is big drift.
>>
>> ...
>>
>>> Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock
>>> operations (not only sched_clock), e.g., after line 300.
>>
>> ...
>>
>>> Should I introduce a new param to disable no-kvm-sched-clock only, or to
>>> introduce a new param to allow the selection of sched_clock?
>>
>> I don't think we want a KVM-specific knob, because every flavor of paravirt guest
>> would need to do the same thing.  And unless there's a good reason to use a
>> paravirt clock, this really shouldn't be something the guest admin needs to opt
>> into using.
> 
> 
> On Mon, Oct 2, 2023 at 2:06 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>
>> On Mon, Oct 02, 2023 at 11:18:50AM -0700, Sean Christopherson wrote:
>>> Assuming the desirable thing to do is to use native_sched_clock() in this
>>> scenario, do we need a separate rating system, or can we simply tie the
>>> sched clock selection to the clocksource selection, e.g. override the
>>> paravirt stuff if the TSC clock has higher priority and is chosen?
>>
>> Yeah, I see no point of another rating system. Just force the thing back
>> to native (or don't set it to that other thing).