Re: x86: kvm: Revert "remove sched notifier for cross-cpu migrations"

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Thu, 26 Mar 2015 11:47:48 -0700

On Wed, Mar 25, 2015 at 4:08 AM, Radim Krčmář <rkrcmar@xxxxxxxxxx> wrote:
> 2015-03-24 15:33-0700, Andy Lutomirski:
>> On Tue, Mar 24, 2015 at 8:34 AM, Radim Krčmář <rkrcmar@xxxxxxxxxx> wrote:
>> > What is the problem?
>>
>> The kvmclock spec says that the host will increment a version field to
>> an odd number, then update stuff, then increment it to an even number.
>> The host is buggy and doesn't do this, and the result is observable
>> when one vcpu reads another vcpu's kvmclock data.
>>
>> Since there's no good way for a guest kernel to keep its vdso from
>> reading a different vcpu's kvmclock data, this is a real corner-case
>> bug.  This patch allows the vdso to retry when this happens.  I don't
>> think it's a great solution, but it should mostly work.
>
> Great explanation, thank you.
>
> Reverting the patch protects us from any migration, but I don't think we
> need to care about changing VCPUs as long as we read a consistent data
> from kvmclock.  (VCPU can change outside of this loop too, so it doesn't
> matter if we return a value not fit for this VCPU.)
>
> I think we could drop the second __getcpu if our kvmclock was being
> handled better;  maybe with a patch like the one below:
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cc2c759f69a3..8658599e0024 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1658,12 +1658,24 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>                 &guest_hv_clock, sizeof(guest_hv_clock))))
>                 return 0;
>
> -       /*
> -        * The interface expects us to write an even number signaling that the
> -        * update is finished. Since the guest won't see the intermediate
> -        * state, we just increase by 2 at the end.
> +       /* A guest can read other VCPU's kvmclock; specification says that
> +        * version is odd if data is being modified and even after it is
> +        * consistent.
> +        * We write three times to be sure.
> +        *  1) update version to odd number
> +        *  2) write modified data (version is still odd)
> +        *  3) update version to even number
> +        *
> +        * TODO: optimize
> +        *  - only two writes should be enough -- version is first
> +        *  - the second write could update just version
>          */
> -       vcpu->hv_clock.version = guest_hv_clock.version + 2;
> +       guest_hv_clock.version += 1;
> +       kvm_write_guest_cached(v->kvm, &vcpu->pv_time,
> +                               &guest_hv_clock,
> +                               sizeof(guest_hv_clock));
> +
> +       vcpu->hv_clock.version = guest_hv_clock.version;
>
>         /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
>         pvclock_flags = (guest_hv_clock.flags & PVCLOCK_GUEST_STOPPED);
> @@ -1684,6 +1696,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>         kvm_write_guest_cached(v->kvm, &vcpu->pv_time,
>                                 &vcpu->hv_clock,
>                                 sizeof(vcpu->hv_clock));
> +
> +       vcpu->hv_clock.version += 1;
> +       kvm_write_guest_cached(v->kvm, &vcpu->pv_time,
> +                               &vcpu->hv_clock,
> +                               sizeof(vcpu->hv_clock));
>         return 0;
>  }
>

The trouble with this is that kvm_write_guest_cached seems to
correspond roughly to a "rep movs" variant, and those are weakly
ordered.  As a result, I don't really know whether they have
well-defined semantics wrt concurrent reads.  What we really want is
just "mov".

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html