Re: [PATCH 3/8] KVM: X86: Pass an additional 'L1' argument to kvm_scale_tsc()

Maxim Levitsky <mlevitsk@xxxxxxxxxx> · Mon, 10 May 2021 16:52:27 +0300

On Thu, 2021-05-06 at 10:32 +0000, ilstam@xxxxxxxxxxx wrote:
> From: Ilias Stamatis <ilstam@xxxxxxxxxx>
> 
> Sometimes kvm_scale_tsc() needs to use the current scaling ratio and
> other times (like when reading the TSC from user space) it needs to use
> L1's scaling ratio. Have the caller specify this by passing an
> additional boolean argument.
> 
> Signed-off-by: Ilias Stamatis <ilstam@xxxxxxxxxx>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/x86.c              | 21 +++++++++++++--------
>  2 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 132e820525fb..cdddbf0b1177 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1779,7 +1779,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
>  void kvm_define_user_return_msr(unsigned index, u32 msr);
>  int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
>  
> -u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc);
> +u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1);
>  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc);
>  
>  unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7bc5155ac6fd..26a4c0f46f15 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2241,10 +2241,14 @@ static inline u64 __scale_tsc(u64 ratio, u64 tsc)
>  	return mul_u64_u64_shr(tsc, ratio, kvm_tsc_scaling_ratio_frac_bits);
>  }
>  
> -u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc)
> +/*
> + * If l1 is true the TSC is scaled using L1's scaling ratio, otherwise
> + * the current scaling ratio is used.
> + */
> +u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1)
>  {
>  	u64 _tsc = tsc;
> -	u64 ratio = vcpu->arch.tsc_scaling_ratio;
> +	u64 ratio = l1 ? vcpu->arch.l1_tsc_scaling_ratio : vcpu->arch.tsc_scaling_ratio;
>  
>  	if (ratio != kvm_default_tsc_scaling_ratio)
>  		_tsc = __scale_tsc(ratio, tsc);
I do wonder if it is better to have kvm_scale_tsc_l1 and kvm_scale_tsc instead for better
readablility?

> @@ -2257,14 +2261,14 @@ static u64 kvm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
>  {
>  	u64 tsc;
>  
> -	tsc = kvm_scale_tsc(vcpu, rdtsc());
> +	tsc = kvm_scale_tsc(vcpu, rdtsc(), true);
Here we have a somewhat dangerous assumption that this function 
will always be used with L1 tsc values. 

The kvm_compute_tsc_offset should at least be renamed to
kvm_compute_tsc_offset_l1 or something like that.

Currently the assumption holds though:

We call the kvm_compute_tsc_offset in:

-> kvm_synchronize_tsc which is nowadays thankfully only called
on TSC writes from qemu, which are assumed to be L1 values.

(this is pending a rework of the whole thing which I started
some time ago but haven't had a chance to finish it yet)

-> Guest write of MSR_IA32_TSC. The value written is in L1 units,
since TSC offset/scaling only covers RDTSC and RDMSR of the IA32_TSC,
while WRMSR should be intercepted by L1 and emulated.
If it is not emulated, then L2 would just write L1 value.

-> in kvm_arch_vcpu_load, when TSC is unstable, we always try to resume
the guest from the same TSC value as it had seen last time,
and then catchup.

Also host TSC values are used, and after reading this function,
I recommend to rename the vcpu->arch.last_guest_tsc
to vcpu->arch.last_guest_tsc_l1 to document this.

>  
>  	return target_tsc - tsc;
>  }
>  
>  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
>  {
> -	return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(vcpu, host_tsc);
> +	return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(vcpu, host_tsc, true);
OK
>  }
>  EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
>  
> @@ -2395,9 +2399,9 @@ static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
>  
>  static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
>  {
> -	if (vcpu->arch.tsc_scaling_ratio != kvm_default_tsc_scaling_ratio)
> +	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_default_tsc_scaling_ratio)
>  		WARN_ON(adjustment < 0);
This should belong to patch 2 IMHO.

> -	adjustment = kvm_scale_tsc(vcpu, (u64) adjustment);
> +	adjustment = kvm_scale_tsc(vcpu, (u64) adjustment, true);
OK
>  	adjust_tsc_offset_guest(vcpu, adjustment);
>  }
>  
> @@ -2780,7 +2784,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	/* With all the info we got, fill in the values */
>  
>  	if (kvm_has_tsc_control)
> -		tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz);
> +		tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz, true);
OK (kvmclock is for L1 only, L1 hypervisor is free to expose its own kvmclock to L2)
>  
>  	if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
>  		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> @@ -3474,7 +3478,8 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset :
>  							    vcpu->arch.tsc_offset;
>  
> -		msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset;
> +		msr_info->data = kvm_scale_tsc(vcpu, rdtsc(),
> +					       msr_info->host_initiated) + tsc_offset;

Since we now do two things that depend on msr_info->host_initiated, I 
think I would prefer to convert this back to regular 'if'.
I don't have a strong opinion on this though.

>  		break;
>  	}
>  	case MSR_MTRRcap:

Best regards,
	Maxim Levitsky