On Wed, Mar 25, 2015 at 4:08 AM, Radim Krčmář <rkrcmar@xxxxxxxxxx> wrote: > 2015-03-24 15:33-0700, Andy Lutomirski: >> On Tue, Mar 24, 2015 at 8:34 AM, Radim Krčmář <rkrcmar@xxxxxxxxxx> wrote: >> > What is the problem? >> >> The kvmclock spec says that the host will increment a version field to >> an odd number, then update stuff, then increment it to an even number. >> The host is buggy and doesn't do this, and the result is observable >> when one vcpu reads another vcpu's kvmclock data. >> >> Since there's no good way for a guest kernel to keep its vdso from >> reading a different vcpu's kvmclock data, this is a real corner-case >> bug. This patch allows the vdso to retry when this happens. I don't >> think it's a great solution, but it should mostly work. > > Great explanation, thank you. > > Reverting the patch protects us from any migration, but I don't think we > need to care about changing VCPUs as long as we read a consistent data > from kvmclock. (VCPU can change outside of this loop too, so it doesn't > matter if we return a value not fit for this VCPU.) > > I think we could drop the second __getcpu if our kvmclock was being > handled better; maybe with a patch like the one below: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index cc2c759f69a3..8658599e0024 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1658,12 +1658,24 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > &guest_hv_clock, sizeof(guest_hv_clock)))) > return 0; > > - /* > - * The interface expects us to write an even number signaling that the > - * update is finished. Since the guest won't see the intermediate > - * state, we just increase by 2 at the end. > + /* A guest can read other VCPU's kvmclock; specification says that > + * version is odd if data is being modified and even after it is > + * consistent. > + * We write three times to be sure. > + * 1) update version to odd number > + * 2) write modified data (version is still odd) > + * 3) update version to even number > + * > + * TODO: optimize > + * - only two writes should be enough -- version is first > + * - the second write could update just version > */ > - vcpu->hv_clock.version = guest_hv_clock.version + 2; > + guest_hv_clock.version += 1; > + kvm_write_guest_cached(v->kvm, &vcpu->pv_time, > + &guest_hv_clock, > + sizeof(guest_hv_clock)); > + > + vcpu->hv_clock.version = guest_hv_clock.version; > > /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */ > pvclock_flags = (guest_hv_clock.flags & PVCLOCK_GUEST_STOPPED); > @@ -1684,6 +1696,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > kvm_write_guest_cached(v->kvm, &vcpu->pv_time, > &vcpu->hv_clock, > sizeof(vcpu->hv_clock)); > + > + vcpu->hv_clock.version += 1; > + kvm_write_guest_cached(v->kvm, &vcpu->pv_time, > + &vcpu->hv_clock, > + sizeof(vcpu->hv_clock)); > return 0; > } > The trouble with this is that kvm_write_guest_cached seems to correspond roughly to a "rep movs" variant, and those are weakly ordered. As a result, I don't really know whether they have well-defined semantics wrt concurrent reads. What we really want is just "mov". --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html