Re: [PATCH v2 01/15] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Tue, 07 May 2024 20:08:40 +0100

On Sat, 2024-05-04 at 04:42 -0300, Marcelo Tosatti wrote:
> On Sat, Apr 27, 2024 at 12:04:58PM +0100, David Woodhouse wrote:
> > 
> > In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of
> > time as the reference in master_kernel_ns and master_cycle_now, yanking
> > the guest's clock back to match definition A at that moment.
> 
> KVM_REQ_MASTERCLOCK_UPDATE stops the vcpus because:

Took me a while to work that one out, btw. The fact that the 
KVM_REQ_MCLOCK_INPROGRESS request is asserted but never actually
*handled*, so all it does is repeatedly kick the vCPU out and make it
spin until the request is cleared is... interesting. Likewise the way
that we set KVM_REQ_MASTERCLOCK_UPDATE on *all* vCPUs, so they *all*
call kvm_update_masterclock(), when only one of them needed to. I may
clean that up a little.

>  * To avoid that problem, do not allow visibility of distinct
>  * system_timestamp/tsc_timestamp values simultaneously: use a master
>  * copy of host monotonic time values. Update that master copy
>  * in lockstep.

Right. That comment is a lot longer than the part you cited here, and
starts with 'assuming a stable TSC across pCPUS, and a stable TSC
across vCPUs'. It's the "if (ka->use_master_clock)" case.

And yes, what it's basically saying is a special case of the fact that
if you let the KVM clock run at its "natural" rate based on the guest
TSC (definition B), but each vCPU runs at that rate from a *different*
point on the line that is definition A (the host CLOCK_MONOTONIC_RAW),
bad things will happen.

I'm OK with it stopping the vCPUs (although I'd like it to do so in a
less implicitly awful way). But when we don't need to update the
reference time at all, let's not do so.

> > When invoked from in 'use_master_clock' mode, kvm_update_masterclock()
> > should probably *adjust* kvm->arch.kvmclock_offset to account for the
> > drift, instead of yanking the clock back to defintion A.
> 
> You are likely correct...
> 
> > But in the meantime there are a bunch of places where it just doesn't need to be
> > invoked at all.
> > 
> > To start with: there is no need to do such an update when a Xen guest
> > populates the shared_info page. This seems to have been a hangover from
> > the very first implementation of shared_info which automatically
> > populated the vcpu_info structures at their default locations, but even
> > then it should just have raised KVM_REQ_CLOCK_UPDATE on each vCPU
> > instead of using KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is
> > expected to explicitly set the vcpu_info even in its default locations,
> > there's not even any need for that either.
> > 
> > Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region")
> > Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
> > Reviewed-by: Paul Durrant <paul@xxxxxxx>
> > ---
> >  arch/x86/kvm/xen.c | 2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> > index f65b35a05d91..5a83a8154b79 100644
> > --- a/arch/x86/kvm/xen.c
> > +++ b/arch/x86/kvm/xen.c
> > @@ -98,8 +98,6 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
> >         wc->version = wc_version + 1;
> >         read_unlock_irq(&gpc->lock);
> >  
> > -       kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
> > -
> >  out:
> >         srcu_read_unlock(&kvm->srcu, idx);
> >         return ret;
> > -- 
> > 2.44.0
> 
> So KVM_REQ_MASTERCLOCK_UPDATE is to avoid the race above.
> 
> In what contexes is kvm_xen_shared_info_init called from again?
> 
> Not clear to me KVM_REQ_MASTERCLOCK_UPDATE is not needed (or that is
> needed, for that matter...).

We cal kvm_xen_shared_info_init() when the Xen "shared info" page is
set up. The only interesting part of that is the *wallclock* epoch.

The wallclock (just like KSR_KVM_WALL_CLOCK{,_NEW}) is *entirely* hosed
ever since the KVM clock stopped being based on CLOCK_MONOTONIC, since
that means that the value of "wallclock time minus KVM clock time"
actually *changes* as the KVM clock runs at a different rate to
wallclock time. 

I'm looking at a replacement for that which uses the gtod information
to give the guest a direct mapping of guest TSC to host CLOCK_TAI. And
in doing so we can *also* indicate when live migration has potentially
disrupted the guest TSC, so any further NTP/PTP refinement that the
guest may have done for itself needs to be thrown away.

Attachment:
smime.p7s

Description: S/MIME cryptographic signature