Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

Dongli Zhang <dongli.zhang@xxxxxxxxxx> · Mon, 16 Oct 2023 15:04:55 -0700

Hi Sean,

On 10/16/23 11:49, Sean Christopherson wrote:
> On Mon, Oct 16, 2023, David Woodhouse wrote:
>> On Mon, 2023-10-16 at 08:47 -0700, Dongli Zhang wrote:
>>> Suppose we are discussing a non-permanenet solution, I would suggest:
>>>
>>> 1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen
>>> on KVM) is not good enough in some cases, e.g., vCPU hotplug.
>>
>> I still don't understand the vCPU hotplug case.
>>
>> In the case where the TSC is actually sane, why would we need to reset
>> the masterclock on vCPU hotplug? 
>>
>> The new vCPU gets its TSC synchronised to the others, and its kvmclock
>> parameters (mul/shift/offset based on the guest TSC) can be *precisely*
>> the same as the other vCPUs too, can't they? Why reset anything?
> 
> Aha!  I think I finally figured out why KVM behaves the way it does.
> 
> The unnecessary masterclock updates effectively came from:
> 
>   commit 7f187922ddf6b67f2999a76dcb71663097b75497
>   Author: Marcelo Tosatti <mtosatti@xxxxxxxxxx>
>   Date:   Tue Nov 4 21:30:44 2014 -0200
> 
>     KVM: x86: update masterclock values on TSC writes
>     
>     When the guest writes to the TSC, the masterclock TSC copy must be
>     updated as well along with the TSC_OFFSET update, otherwise a negative
>     tsc_timestamp is calculated at kvm_guest_time_update.
>     
>     Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
>     "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
>     becomes redundant, so remove the do_request boolean and collapse
>     everything into a single condition.
> 
> Before that, KVM only re-synced the masterclock if it was enabled or disabled,
> i.e. KVM behaved as we want it to behave.  Note, at the time of the above commit,
> VMX synchronized TSC on *guest* writes to MSR_IA32_TSC:
> 
> 	case MSR_IA32_TSC:
>         	kvm_write_tsc(vcpu, msr_info);
> 	        break;
> 
> That got changed by commit 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization
> on guest writes"), but I don't think the guest angle is actually relevant to the
> fix.  AFAICT, a write from the host would suffer the same problem.  But knowing
> that KVM synced on guest writes is crucial to understanding the changelog.
> 
> In kvm_write_tsc(), except for KVM's wonderful "less than 1 second" hack, KVM
> snapshotted the vCPU's current TSC *and* the current time in nanoseconds, where
> kvm->arch.cur_tsc_nsec is the current host kernel time in nanoseconds.
> 
> 	ns = get_kernel_ns();
> 
> 	...
> 
>         if (usdiff < USEC_PER_SEC &&
>             vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
> 		...
>         } else {
>                 /*
>                  * We split periods of matched TSC writes into generations.
>                  * For each generation, we track the original measured
>                  * nanosecond time, offset, and write, so if TSCs are in
>                  * sync, we can match exact offset, and if not, we can match
>                  * exact software computation in compute_guest_tsc()
>                  *
>                  * These values are tracked in kvm->arch.cur_xxx variables.
>                  */
>                 kvm->arch.cur_tsc_generation++;
>                 kvm->arch.cur_tsc_nsec = ns;
>                 kvm->arch.cur_tsc_write = data;
>                 kvm->arch.cur_tsc_offset = offset;
>                 matched = false;
>                 pr_debug("kvm: new tsc generation %llu, clock %llu\n",
>                          kvm->arch.cur_tsc_generation, data);
>         }
> 
> 	...
> 
>         /* Keep track of which generation this VCPU has synchronized to */
>         vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
>         vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
>         vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
> 
> Note that the above sets matched to false!  But because kvm_track_tsc_matching()
> looks for matched+1, i.e. doesn't require the first vCPU to match itself, KVM
> would immediately compute vcpus_matched as true for VMs with a single vCPU.  As
> a result, KVM would skip the masterlock update, even though a new TSC generation
> was created.
> 
>         vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
>                          atomic_read(&vcpu->kvm->online_vcpus));
> 
>         if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
>                 if (!ka->use_master_clock)
>                         do_request = 1;
> 
>         if (!vcpus_matched && ka->use_master_clock)
>                         do_request = 1;
> 
>         if (do_request)
>                 kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
> 
> On hardware without TSC scaling support, vcpu->tsc_catchup is set to true if the
> guest TSC frequency is faster than the host TSC frequency, even if the TSC is
> otherwise stable.  And for that mode, kvm_guest_time_update(), by way of
> compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the kernel time at the
> last TSC write, to compute the guest TSC relative to kernel time:
> 
>   static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
>   {
> 	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
> 				      vcpu->arch.virtual_tsc_mult,
> 				      vcpu->arch.virtual_tsc_shift);
> 	tsc += vcpu->arch.this_tsc_write;
> 	return tsc;
>   }
> 
> Except the @kernel_ns passed to compute_guest_tsc() isn't the current kernel time,
> it's the masterclock snapshot!
> 
>         spin_lock(&ka->pvclock_gtod_sync_lock);
>         use_master_clock = ka->use_master_clock;
>         if (use_master_clock) {
>                 host_tsc = ka->master_cycle_now;
>                 kernel_ns = ka->master_kernel_ns;
>         }
>         spin_unlock(&ka->pvclock_gtod_sync_lock);
> 
> 	if (vcpu->tsc_catchup) {
> 		u64 tsc = compute_guest_tsc(v, kernel_ns);
> 		if (tsc > tsc_timestamp) {
> 			adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
> 			tsc_timestamp = tsc;
> 		}
> 	}
> 
> And so the "kernel_ns-vcpu->arch.this_tsc_nsec" is *guaranteed* to generate a
> negative value, because this_tsc_nsec was captured after ka->master_kernel_ns.
> 
> Forcing a masterclock update essentially fudged around that problem, but in a
> heavy handed way that introduced undesirable side effects, i.e. unnecessarily
> forces a masterclock update when a new vCPU joins the party via hotplug.
> 
> Compile tested only, but the below should fix the vCPU hotplug case.  Then
> someone (not me) just needs to figure out why kvm_xen_shared_info_init() forces
> a masterclock update.
> 
> I still think we should clean up the periodic sync code, but I don't think we
> need to periodically sync the masterclock.

This looks good to me. The core idea is to not update master clock for the
synchronized cases.

How about the negative value case? I see in the linux code it is still there?

(It is out of the scope of my expectation as I do not need to run vCPUs in
different tsc freq as host)

Thank you very much!

Dongli Zhang

> 
> ---
>  arch/x86/kvm/x86.c | 29 ++++++++++++++++-------------
>  1 file changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c54e1133e0d3..f0a607b6fc31 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode)
>  }
>  #endif
>  
> -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
> +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
>  {
>  #ifdef CONFIG_X86_64
> -	bool vcpus_matched;
>  	struct kvm_arch *ka = &vcpu->kvm->arch;
>  	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
>  
> -	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> -			 atomic_read(&vcpu->kvm->online_vcpus));
> +	/*
> +	 * To use the masterclock, the host clocksource must be based on TSC
> +	 * and all vCPUs must have matching TSCs.  Note, the count for matching
> +	 * vCPUs doesn't include the reference vCPU, hence "+1".
> +	 */
> +	bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 ==
> +				 atomic_read(&vcpu->kvm->online_vcpus)) &&
> +				gtod_is_based_on_tsc(gtod->clock.vclock_mode);
>  
>  	/*
> -	 * Once the masterclock is enabled, always perform request in
> -	 * order to update it.
> -	 *
> -	 * In order to enable masterclock, the host clocksource must be TSC
> -	 * and the vcpus need to have matched TSCs.  When that happens,
> -	 * perform request to enable masterclock.
> +	 * Request a masterclock update if the masterclock needs to be toggled
> +	 * on/off, or when starting a new generation and the masterclock is
> +	 * enabled (compute_guest_tsc() requires the masterclock snaphot to be
> +	 * taken _after_ the new generation is created).
>  	 */
> -	if (ka->use_master_clock ||
> -	    (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
> +	if ((ka->use_master_clock && new_generation) ||
> +	    (ka->use_master_clock != use_master_clock))
>  		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>  
>  	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
> @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
>  	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
>  	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>  
> -	kvm_track_tsc_matching(vcpu);
> +	kvm_track_tsc_matching(vcpu, !matched);
>  }
>  
>  static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
> 
> base-commit: dfdc8b7884b50e3bfa635292973b530a97689f12