On Tue, 2021-05-25 at 15:58 +0000, Sean Christopherson wrote: > On Tue, May 25, 2021, Stamatis, Ilias wrote: > > On Mon, 2021-05-24 at 18:44 +0000, Sean Christopherson wrote: > > > Yes, but its existence is a complete hack. vmx->current_tsc_ratio has the same > > > scope as vcpu->arch.tsc_scaling_ratio, i.e. vmx == vcpu == vcpu->arch. Unlike > > > per-VMCS tracking, it should not be useful, keyword "should". > > > > > > What I meant by my earlier comment: > > > > > > Its use in vmx_vcpu_load_vmcs() is basically "write the VMCS if we forgot to > > > earlier", which is all kinds of wrong. > > > > > > is that vmx_vcpu_load_vmcs() should never write vmcs.TSC_MULTIPLIER. The correct > > > behavior is to set the field at VMCS initialization, and then immediately set it > > > whenever the ratio is changed, e.g. on nested transition, from userspace, etc... > > > In other words, my unclear feedback was to make it obsolete (and drop it) by > > > fixing the underlying mess, not to just drop the optimization hack. > > > > I understood this and replied earlier. The right place for the hw multiplier > > field to be updated is inside set_tsc_khz() in common code when the ratio > > changes. However, this requires adding another vendor callback etc. As all > > this is further refactoring I believe it's better to leave this series as is - > > ie only touching code that is directly related to nested TSC scaling and not > > try to do everything as part of the same series. > > But it directly impacts your code, e.g. the nested enter/exit flows would need > to dance around the decache silliness. And I believe it even more directly > impacts this series: kvm_set_tsc_khz() fails to handle the case where userspace > invokes KVM_SET_TSC_KHZ while L2 is active. > > > This makes testing easier too. > > Hmm, sort of. Yes, the fewer patches/modifications in a series definitely makes > the series itself easier to test. But stepping back and looking at the total > cost of testing, I would argue that punting related changes to a later time > increases the overall cost. E.g. if someone else picks up the clean up work, > then they have to redo most, if not all, of the testing that you are already > doing, including getting access to the proper hardware, understanding what tests > to prioritize, etc... Whereas adding one more patch to your series is an > incremental cost since you already have the hardware setup, know which tests to > run, etc... > > > We can still implement these changes later. > > We can, but we shouldn't. Simply dropping vmx->current_tsc_ratio is not an > option; it knowingly introduces a (minor) performance regression, for no reason > other than wanting to avoid code churn. Piling more stuff on top of the flawed > decache logic is impolite, as it adds more work for the person that ends up > doing the cleanup. I would 100% agree if this were a significant cleanup and/or > completely unrelated, but IMO that's not the case. > > Compile tested only... > > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h > index 029c9615378f..34ad7a17458a 100644 > --- a/arch/x86/include/asm/kvm-x86-ops.h > +++ b/arch/x86/include/asm/kvm-x86-ops.h > @@ -90,6 +90,7 @@ KVM_X86_OP_NULL(has_wbinvd_exit) > KVM_X86_OP(get_l2_tsc_offset) > KVM_X86_OP(get_l2_tsc_multiplier) > KVM_X86_OP(write_tsc_offset) > +KVM_X86_OP(write_tsc_multiplier) > KVM_X86_OP(get_exit_info) > KVM_X86_OP(check_intercept) > KVM_X86_OP(handle_exit_irqoff) > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index f099277b993d..a334ce7741ab 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1308,6 +1308,7 @@ struct kvm_x86_ops { > u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu); > u64 (*get_l2_tsc_multiplier)(struct kvm_vcpu *vcpu); > void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); > + void (*write_tsc_multiplier)(struct kvm_vcpu *vcpu, u64 multiplier); > > /* > * Retrieve somewhat arbitrary exit information. Intended to be used > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index b18f60463073..914afcceb46d 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -1103,6 +1103,14 @@ static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) > vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); > } > > +static void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 l1_multiplier) > +{ > + /* > + * Handled when loading guest state since the ratio is programmed via > + * MSR_AMD64_TSC_RATIO, not a field in the VMCB. > + */ > +} > + > /* Evaluate instruction intercepts that depend on guest CPUID features. */ > static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu, > struct vcpu_svm *svm) > @@ -4528,6 +4536,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { > .get_l2_tsc_offset = svm_get_l2_tsc_offset, > .get_l2_tsc_multiplier = svm_get_l2_tsc_multiplier, > .write_tsc_offset = svm_write_tsc_offset, > + .write_tsc_multiplier = svm_write_tsc_multiplier, > > .load_mmu_pgd = svm_load_mmu_pgd, > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c > index 6058a65a6ede..712190493926 100644 > --- a/arch/x86/kvm/vmx/nested.c > +++ b/arch/x86/kvm/vmx/nested.c > @@ -2535,7 +2535,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, > vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset); > > if (kvm_has_tsc_control) > - decache_tsc_multiplier(vmx); > + vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_scaling_ratio); > > nested_vmx_transition_tlb_flush(vcpu, vmcs12, true); > > @@ -4505,7 +4505,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, > vmcs_write32(TPR_THRESHOLD, vmx->nested.l1_tpr_threshold); > > if (kvm_has_tsc_control) > - decache_tsc_multiplier(vmx); > + vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_scaling_ratio); > > if (vmx->nested.change_vmcs01_virtual_apic_mode) { > vmx->nested.change_vmcs01_virtual_apic_mode = false; > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 4b70431c2edd..bf845a08995e 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -1390,11 +1390,6 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu, > > vmx->loaded_vmcs->cpu = cpu; > } > - > - /* Setup TSC multiplier */ > - if (kvm_has_tsc_control && > - vmx->current_tsc_ratio != vcpu->arch.tsc_scaling_ratio) > - decache_tsc_multiplier(vmx); > } > > /* > @@ -1813,6 +1808,11 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) > vmcs_write64(TSC_OFFSET, offset); > ...skipping... > +++ b/arch/x86/kvm/vmx/vmx.h > @@ -322,8 +322,6 @@ struct vcpu_vmx { > /* apic deadline value in host tsc */ > u64 hv_deadline_tsc; > > - u64 current_tsc_ratio; > - > unsigned long host_debugctlmsr; > > /* > @@ -532,12 +530,6 @@ static inline struct vmcs *alloc_vmcs(bool shadow) > GFP_KERNEL_ACCOUNT); > } > > -static inline void decache_tsc_multiplier(struct vcpu_vmx *vmx) > -{ > - vmx->current_tsc_ratio = vmx->vcpu.arch.tsc_scaling_ratio; > - vmcs_write64(TSC_MULTIPLIER, vmx->current_tsc_ratio); > -} > - > static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx) > { > return vmx->secondary_exec_control & > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index b61b54cea495..690de1868873 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2179,14 +2179,16 @@ static u32 adjust_tsc_khz(u32 khz, s32 ppm) > return v; > } > > +static void kvm_vcpu_write_tsc_multiplier(struct kvm_vcpu *vcpu, > + u64 l1_multiplier); > + > static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale) > { > u64 ratio; > > /* Guest TSC same frequency as host TSC? */ > if (!scale) { > - vcpu->arch.l1_tsc_scaling_ratio = kvm_default_tsc_scaling_ratio; > - vcpu->arch.tsc_scaling_ratio = kvm_default_tsc_scaling_ratio; > + kvm_vcpu_write_tsc_multiplier(vcpu, kvm_default_tsc_scaling_ratio); > return 0; > } > > @@ -2212,7 +2214,7 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale) > return -1; > } > > - vcpu->arch.l1_tsc_scaling_ratio = vcpu->arch.tsc_scaling_ratio = ratio; > + kvm_vcpu_write_tsc_multiplier(vcpu, ratio); > return 0; > } > > @@ -2224,8 +2226,7 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz) > /* tsc_khz can be zero if TSC calibration fails */ > if (user_tsc_khz == 0) { > /* set tsc_scaling_ratio to a safe value */ > - vcpu->arch.l1_tsc_scaling_ratio = kvm_default_tsc_scaling_ratio; > - vcpu->arch.tsc_scaling_ratio = kvm_default_tsc_scaling_ratio; > + kvm_vcpu_write_tsc_multiplier(vcpu, kvm_default_tsc_scaling_ratio); > return -1; > } > > @@ -2383,6 +2384,25 @@ static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 l1_offset) > static_call(kvm_x86_write_tsc_offset)(vcpu, vcpu->arch.tsc_offset); > } > > +static void kvm_vcpu_write_tsc_multiplier(struct kvm_vcpu *vcpu, > + u64 l1_multiplier) > +{ > + if (!kvm_has_tsc_control) > + return; > + > + vcpu->arch.l1_tsc_scaling_ratio = l1_multiplier; > + > + /* Userspace is changing the multiplier while L2 is active... */ > + if (is_guest_mode(vcpu)) > + vcpu->arch.tsc_scaling_ratio = kvm_calc_nested_tsc_multiplier( > + l1_multiplier, > + static_call(kvm_x86_get_l2_tsc_multiplier)(vcpu)); > + else > + vcpu->arch.tsc_scaling_ratio = l1_multiplier; > + > + static_call(kvm_x86_write_tsc_multiplier)(vcpu, vcpu->arch.tsc_scaling_ratio); > +} > + > static inline bool kvm_check_tsc_unstable(void) > { > #ifdef CONFIG_X86_64 Hmm, this patch actually still removes the caching and introduces a small performance overhead. For example if neither L1 nor L2 are scaled it will still do a vmwrite for every L2 entry/write. So do we want to get rid of decache_tsc_multiplier() but keep vmx->current_tsc_ratio and do the check inside write_tsc_multiplier()? Or alternatively delete vmx->current_tsc_ratio too and have write_tsc_multiplier() receive 2 parameters, one of the old multiplier and one of the new?