On Thu, Apr 25, 2024 at 03:07:01PM -0700, Reinette Chatre wrote: > From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC > bus clock frequency for APIC timer emulation. > Allow KVM_ENABLE_CAPABILITY(KVM_CAP_X86_APIC_BUS_CYCLES_NS) to set the > frequency in nanoseconds. When using this capability, the user space > VMM should configure CPUID leaf 0x15 to advertise the frequency. > > Vishal reported that the TDX guest kernel expects a 25MHz APIC bus > frequency but ends up getting interrupts at a significantly higher rate. > > The TDX architecture hard-codes the core crystal clock frequency to > 25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture > does not allow the VMM to override the value. > > In addition, per Intel SDM: > "The APIC timer frequency will be the processor’s bus clock or core > crystal clock frequency (when TSC/core crystal clock ratio is > enumerated in CPUID leaf 0x15) divided by the value specified in > the divide configuration register." > > The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded > APIC bus frequency of 1GHz. > > The KVM doesn't enumerate CPUID leaf 0x15 to the guest unless the user > space VMM sets it using KVM_SET_CPUID. If the CPUID leaf 0x15 is > enumerated, the guest kernel uses it as the APIC bus frequency. If not, > the guest kernel measures the frequency based on other known timers like > the ACPI timer or the legacy PIT. As reported by Vishal the TDX guest > kernel expects a 25MHz timer frequency but gets timer interrupt more > frequently due to the 1GHz frequency used by KVM. > > To ensure that the guest doesn't have a conflicting view of the APIC bus > frequency, allow the userspace to tell KVM to use the same frequency that > TDX mandates instead of the default 1Ghz. > > There are several options to address this: > 1. Make the KVM able to configure APIC bus frequency (this series). > Pro: It resembles the existing hardware. The recent Intel CPUs > adapts 25MHz. > Con: Require the VMM to emulate the APIC timer at 25MHz. > 2. Make the TDX architecture enumerate CPUID leaf 0x15 to configurable > frequency or not enumerate it. > Pro: Any APIC bus frequency is allowed. > Con: Deviates from TDX architecture. > 3. Make the TDX guest kernel use 1GHz when it's running on KVM. > Con: The kernel ignores CPUID leaf 0x15. > 4. Change CPUID leaf 0x15 under TDX to report the crystal clock frequency > as 1 GHz. > Pro: This has been the virtual APIC frequency for KVM guests for 13 > years. > Pro: This requires changing only one hard-coded constant in TDX. > Con: It doesn't work with other VMMs as TDX isn't specific to KVM. > Con: Core crystal clock frequency is also used to calculate TSC > frequency. > Con: If it is configured to value different from hardware, it will > break the correctness of INTEL-PT Mini Time Count (MTC) packets > in TDs. Reviewed-by: Yuan Yao <yuan.yao@xxxxxxxxx> > > Reported-by: Vishal Annapurve <vannapurve@xxxxxxxxxx> > Closes: https://lore.kernel.org/lkml/20231006011255.4163884-1-vannapurve@xxxxxxxxxx/ > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > Reviewed-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx> > Co-developed-by: Reinette Chatre <reinette.chatre@xxxxxxxxx> > Signed-off-by: Reinette Chatre <reinette.chatre@xxxxxxxxx> > --- > Changes v5: > - Rename capability KVM_CAP_X86_APIC_BUS_FREQUENCY -> > KVM_CAP_X86_APIC_BUS_CYCLES_NS. (Xiaoyao Li) > - Add Rick's Reviewed-by tag. > > Changes v4: > - Rework implementation following Sean's guidance in: > https://lore.kernel.org/all/ZdjzIgS6EAeCsUue@xxxxxxxxxx/ > - Reword con #2 to acknowledge feedback. (Sean) > - Add the "con" information from Xiaoyao during earlier review of v2. > - Rework changelog to address comments related to "bus clock" vs > "core crystal clock" frequency. (Xiaoyao) > - Drop snippet about impact on TSC deadline timer emulation. (Maxim) > - Drop Maxim Levitsky's "Reviewed-by" tag due to many changes to patch > since tag received. > - Switch "Subject:" to match custom "KVM: X86:" -> "KVM: x86:" > > Changes v3: > - Added reviewed-by Maxim Levitsky. > - Minor update of the commit message. > > Changes v2: > - Add check if vcpu isn't created. > - Add check if lapic chip is in-kernel emulation. > - Fix build error for i386. > - Add document to api.rst. > - Typo in the commit message. > > Documentation/virt/kvm/api.rst | 17 +++++++++++++++++ > arch/x86/kvm/x86.c | 27 +++++++++++++++++++++++++++ > include/uapi/linux/kvm.h | 1 + > 3 files changed, 45 insertions(+) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index f0b76ff5030d..f014cd9b2217 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -8063,6 +8063,23 @@ error/annotated fault. > > See KVM_EXIT_MEMORY_FAULT for more information. > > +7.35 KVM_CAP_X86_APIC_BUS_CYCLES_NS > +----------------------------------- > + > +:Architectures: x86 > +:Target: VM > +:Parameters: args[0] is the desired APIC bus clock rate, in nanoseconds > +:Returns: 0 on success, -EINVAL if args[0] contains an invalid value for the > + frequency or if any vCPUs have been created, -ENXIO if a virtual > + local APIC has not been created using KVM_CREATE_IRQCHIP. > + > +This capability sets VM's APIC bus clock frequency, used by KVM's in-kernel > +virtual APIC when emulating APIC timers. KVM's default value can be retrieved > +by KVM_CHECK_EXTENSION. > + > +Note: Userspace is responsible for correctly configuring CPUID 0x15, a.k.a. the > +core crystal clock frequency, if a non-zero CPUID 0x15 is exposed to the guest. > + > 8. Other capabilities. > ====================== > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 10e6315103f4..fa6954c9a9d2 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -4715,6 +4715,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > case KVM_CAP_MEMORY_FAULT_INFO: > r = 1; > break; > + case KVM_CAP_X86_APIC_BUS_CYCLES_NS: > + r = APIC_BUS_CYCLE_NS_DEFAULT; > + break; > case KVM_CAP_EXIT_HYPERCALL: > r = KVM_EXIT_HYPERCALL_VALID_MASK; > break; > @@ -6755,6 +6758,30 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > } > mutex_unlock(&kvm->lock); > break; > + case KVM_CAP_X86_APIC_BUS_CYCLES_NS: { > + u64 bus_cycle_ns = cap->args[0]; > + u64 unused; > + > + r = -EINVAL; > + /* > + * Guard against overflow in tmict_to_ns(). 128 is the highest > + * divide value that can be programmed in APIC_TDCR. > + */ > + if (!bus_cycle_ns || > + check_mul_overflow((u64)U32_MAX * 128, bus_cycle_ns, &unused)) > + break; > + > + r = 0; > + mutex_lock(&kvm->lock); > + if (!irqchip_in_kernel(kvm)) > + r = -ENXIO; > + else if (kvm->created_vcpus) > + r = -EINVAL; > + else > + kvm->arch.apic_bus_cycle_ns = bus_cycle_ns; > + mutex_unlock(&kvm->lock); > + break; > + } > default: > r = -EINVAL; > break; > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index 2190adbe3002..6a4d9432ab11 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -917,6 +917,7 @@ struct kvm_enable_cap { > #define KVM_CAP_MEMORY_ATTRIBUTES 233 > #define KVM_CAP_GUEST_MEMFD 234 > #define KVM_CAP_VM_TYPES 235 > +#define KVM_CAP_X86_APIC_BUS_CYCLES_NS 236 > > struct kvm_irq_routing_irqchip { > __u32 irqchip; > -- > 2.34.1 > >