On Wed, 2025-02-26 at 18:18 -0800, Sean Christopherson wrote: > This... snowballed a bit. > > The bulk of the changes are in kvmclock and TSC, but pretty much every > hypervisor's guest-side code gets touched at some point. I am reaonsably > confident in the correctness of the KVM changes. For all other hypervisors, > assume it's completely broken until proven otherwise. > > Note, I deliberately omitted: > > Alexey Makhalov <alexey.amakhalov@xxxxxxxxxxxx> > jailhouse-dev@xxxxxxxxxxxxxxxx > > from the To/Cc, as those emails bounced on the last version, and I have zero > desire to get 38*2 emails telling me an email couldn't be delivered. > > The primary goal of this series is (or at least was, when I started) to > fix flaws with SNP and TDX guests where a PV clock provided by the untrusted > hypervisor is used instead of the secure/trusted TSC that is controlled by > trusted firmware. > > The secondary goal is to draft off of the SNP and TDX changes to slightly > modernize running under KVM. Currently, KVM guests will use TSC for > clocksource, but not sched_clock. And they ignore Intel's CPUID-based TSC > and CPU frequency enumeration, even when using the TSC instead of kvmclock. > And if the host provides the core crystal frequency in CPUID.0x15, then KVM > guests can use that for the APIC timer period instead of manually calibrating > the frequency. > > Lots more background on the SNP/TDX motiviation: > https://lore.kernel.org/all/20250106124633.1418972-13-nikunj@xxxxxxx Looks good; thanks for tackling this. I think there are still some things from my older series at https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@xxxxxxxxxxxxx/ which this doesn't address. Specifically, the accuracy and consistency of what KVM advertises to the guest as the KVM clock. And as the Xen clock, more to the point — because guests generally *know* that the KVM clock is awful, but expect better of the Xen clock. With a sane and consistent TSC, the mul/shift factors that KVM presents to the guest in the kvmclock structure should basically *never* change. Not even on live update (or live migration between hosts with the same host TSC frequency). Take live update as the simple case: serializing the QEMU state and restarting it immediately, just to update QEMU with the guest experiencing only a few milliseconds of steal time. The guest TSC has a fixed arithmetic relationship to the host TSC. That should *not* change across the live update; not by a single count. I don't believe the KVM APIs allow userspace to get that right, which is resolved by the KVM_VCPU_TSC_SCALE ioctl in patch 7 of that series: https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@xxxxxxxxxxxxx/ And then the KVM clock should have a fixed arithmetic relationship to the guest TSC, which should *also* not change. Not even over live migration — userspace should ensure the guest TSC is as accurate as possible given NTP synchronisation between the hosts, and then the KVM clock remains a fixed function of the guest TSC (at least, if the guest TSC is the same frequency on source and destination). The existing KVM API doesn't allow userspace to get *that* right either, which is addressed by Jack's patch 3 of the series: https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@xxxxxxxxxxxxx/ The rest of the series is mostly fixing a bunch of places where KVM gratuitously recalculates the KVM clock that it advertises to the guest, and the fact that it does so *badly* in some cases, with a loss of precision that causes errors in the guest. You may already have addressed some of those; I'll go over my series and see what still applies on top of yours. > > v2: > - Add struct to hold the TSC CPUID output. [Boris] > - Don't pointlessly inline the TSC CPUID helpers. [Boris] > - Fix a variable goof in a helper, hopefully for real this time. [Dan] > - Collect reviews. [Nikunj] > - Override the sched_clock save/restore hooks if and only if a PV clock > is successfully registered. > - During resome, restore clocksources before reading persistent time. > - Clean up more warts created by kvmclock. > - Fix more bugs in kvmclock's suspend/resume handling. > - Try to harden kvmclock against future bugs. > > v1: https://lore.kernel.org/all/20250201021718.699411-1-seanjc@xxxxxxxxxx > > Sean Christopherson (38): > x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15 > x86/tsc: Add standalone helper for getting CPU frequency from CPUID > x86/tsc: Add helper to register CPU and TSC freq calibration routines > x86/sev: Mark TSC as reliable when configuring Secure TSC > x86/sev: Move check for SNP Secure TSC support to tsc_early_init() > x86/tdx: Override PV calibration routines with CPUID-based calibration > x86/acrn: Mark TSC frequency as known when using ACRN for calibration > clocksource: hyper-v: Register sched_clock save/restore iff it's > necessary > clocksource: hyper-v: Drop wrappers to sched_clock save/restore > helpers > clocksource: hyper-v: Don't save/restore TSC offset when using HV > sched_clock > x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y > x86/kvm: Don't disable kvmclock on BSP in syscore_suspend() > x86/paravirt: Move handling of unstable PV clocks into > paravirt_set_sched_clock() > x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c > x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks > x86/vmware: Nullify save/restore hooks when using VMware's sched_clock > x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock > x86/paravirt: Pass sched_clock save/restore helpers during > registration > x86/kvmclock: Move kvm_sched_clock_init() down in kvmclock.c > x86/xen/time: Mark xen_setup_vsyscall_time_info() as __init > x86/pvclock: Mark setup helpers and related various as > __init/__ro_after_init > x86/pvclock: WARN if pvclock's valid_flags are overwritten > x86/kvmclock: Refactor handling of PVCLOCK_TSC_STABLE_BIT during > kvmclock_init() > timekeeping: Resume clocksources before reading persistent clock > x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't > sched_clock > x86/kvmclock: WARN if wall clock is read while kvmclock is suspended > x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't > sched_clock > x86/paravirt: Mark __paravirt_set_sched_clock() as __init > x86/paravirt: Plumb a return code into __paravirt_set_sched_clock() > x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted > TSC > x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration > x86/tsc: Rejects attempts to override TSC calibration with lesser > routine > x86/kvmclock: Mark TSC as reliable when it's constant and nonstop > x86/kvmclock: Get CPU base frequency from CPUID when it's available > x86/kvmclock: Get TSC frequency from CPUID when its available > x86/kvmclock: Stuff local APIC bus period when core crystal freq comes > from CPUID > x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop > x86/paravirt: kvmclock: Setup kvmclock early iff it's sched_clock > > arch/x86/coco/sev/core.c | 9 +- > arch/x86/coco/tdx/tdx.c | 27 ++- > arch/x86/include/asm/kvm_para.h | 10 +- > arch/x86/include/asm/paravirt.h | 16 +- > arch/x86/include/asm/tdx.h | 2 + > arch/x86/include/asm/tsc.h | 20 +++ > arch/x86/include/asm/x86_init.h | 2 - > arch/x86/kernel/cpu/acrn.c | 5 +- > arch/x86/kernel/cpu/mshyperv.c | 69 +------- > arch/x86/kernel/cpu/vmware.c | 11 +- > arch/x86/kernel/jailhouse.c | 6 +- > arch/x86/kernel/kvm.c | 39 +++-- > arch/x86/kernel/kvmclock.c | 260 +++++++++++++++++++++-------- > arch/x86/kernel/paravirt.c | 35 +++- > arch/x86/kernel/pvclock.c | 9 +- > arch/x86/kernel/smpboot.c | 2 +- > arch/x86/kernel/tsc.c | 141 ++++++++++++---- > arch/x86/kernel/x86_init.c | 1 - > arch/x86/mm/mem_encrypt_amd.c | 3 - > arch/x86/xen/time.c | 13 +- > drivers/clocksource/hyperv_timer.c | 38 +++-- > include/clocksource/hyperv_timer.h | 2 - > kernel/time/timekeeping.c | 9 +- > 23 files changed, 487 insertions(+), 242 deletions(-) > > > base-commit: a64dcfb451e254085a7daee5fe51bf22959d52d3
Attachment:
smime.p7s
Description: S/MIME cryptographic signature