Re: [PATCH v3 09/12] KVM: VMX: Remove vmx->current_tsc_ratio and decache_tsc_multiplier()

"Stamatis, Ilias" <ilstam@xxxxxxxxxx> · Tue, 25 May 2021 18:52:51 +0000

On Tue, 2021-05-25 at 15:58 +0000, Sean Christopherson wrote:
> On Tue, May 25, 2021, Stamatis, Ilias wrote:
> > On Mon, 2021-05-24 at 18:44 +0000, Sean Christopherson wrote:
> > > Yes, but its existence is a complete hack.  vmx->current_tsc_ratio has the same
> > > scope as vcpu->arch.tsc_scaling_ratio, i.e. vmx == vcpu == vcpu->arch.  Unlike
> > > per-VMCS tracking, it should not be useful, keyword "should".
> > > 
> > > What I meant by my earlier comment:
> > > 
> > >   Its use in vmx_vcpu_load_vmcs() is basically "write the VMCS if we forgot to
> > >   earlier", which is all kinds of wrong.
> > > 
> > > is that vmx_vcpu_load_vmcs() should never write vmcs.TSC_MULTIPLIER.  The correct
> > > behavior is to set the field at VMCS initialization, and then immediately set it
> > > whenever the ratio is changed, e.g. on nested transition, from userspace, etc...
> > > In other words, my unclear feedback was to make it obsolete (and drop it) by
> > > fixing the underlying mess, not to just drop the optimization hack.
> > 
> > I understood this and replied earlier. The right place for the hw multiplier
> > field to be updated is inside set_tsc_khz() in common code when the ratio
> > changes. However, this requires adding another vendor callback etc. As all
> > this is further refactoring I believe it's better to leave this series as is -
> > ie only touching code that is directly related to nested TSC scaling and not
> > try to do everything as part of the same series.
> 
> But it directly impacts your code, e.g. the nested enter/exit flows would need
> to dance around the decache silliness.  And I believe it even more directly
> impacts this series: kvm_set_tsc_khz() fails to handle the case where userspace
> invokes KVM_SET_TSC_KHZ while L2 is active.

Good catch!

> 
> > This makes testing easier too.
> 
> Hmm, sort of.  Yes, the fewer patches/modifications in a series definitely makes
> the series itself easier to test.  But stepping back and looking at the total
> cost of testing, I would argue that punting related changes to a later time
> increases the overall cost.  E.g. if someone else picks up the clean up work,
> then they have to redo most, if not all, of the testing that you are already
> doing, including getting access to the proper hardware, understanding what tests
> to prioritize, etc...  Whereas adding one more patch to your series is an
> incremental cost since you already have the hardware setup, know which tests to
> run, etc...
> 
> > We can still implement these changes later.
> 
> We can, but we shouldn't.  Simply dropping vmx->current_tsc_ratio is not an
> option; it knowingly introduces a (minor) performance regression, for no reason
> other than wanting to avoid code churn.  Piling more stuff on top of the flawed
> decache logic is impolite, as it adds more work for the person that ends up
> doing the cleanup.  I would 100% agree if this were a significant cleanup and/or
> completely unrelated, but IMO that's not the case.
> 
> Compile tested only...

Thank you. 

> 
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 029c9615378f..34ad7a17458a 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -90,6 +90,7 @@ KVM_X86_OP_NULL(has_wbinvd_exit)
>  KVM_X86_OP(get_l2_tsc_offset)
>  KVM_X86_OP(get_l2_tsc_multiplier)
>  KVM_X86_OP(write_tsc_offset)
> +KVM_X86_OP(write_tsc_multiplier)
>  KVM_X86_OP(get_exit_info)
>  KVM_X86_OP(check_intercept)
>  KVM_X86_OP(handle_exit_irqoff)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f099277b993d..a334ce7741ab 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1308,6 +1308,7 @@ struct kvm_x86_ops {
>         u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
>         u64 (*get_l2_tsc_multiplier)(struct kvm_vcpu *vcpu);
>         void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
> +       void (*write_tsc_multiplier)(struct kvm_vcpu *vcpu, u64 multiplier);
> 
>         /*
>          * Retrieve somewhat arbitrary exit information.  Intended to be used
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index b18f60463073..914afcceb46d 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1103,6 +1103,14 @@ static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>         vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
>  }
> 
> +static void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 l1_multiplier)
> +{
> +       /*
> +        * Handled when loading guest state since the ratio is programmed via
> +        * MSR_AMD64_TSC_RATIO, not a field in the VMCB.
> +        */
> +}
> +

Ok, what I wanted to avoid really is having to dig into SVM code and see where
exactly it sets the TSC multiplier or having to implement
svm_write_tsc_multiplier as I knew AMD uses an MSR instead of a VMCB field.

But if we are fine with introducing this as is above (for now) I will include 
this in the series, apply the other small changes suggested and re-post the 
patches.