On Thu, Jan 8, 2015 at 2:43 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: > On Thu, Jan 8, 2015 at 2:31 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: >> On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote: >>> On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: >>> > On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote: >>> >> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: >>> >> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote: >>> >> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@xxxxxxxxxx> wrote: >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > On 06/01/2015 09:42, Paolo Bonzini wrote: >>> >> >> > > > > Still confused. So we can freeze all vCPUs in the host, then update >>> >> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have >>> >> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM >>> >> >> > > > > doesn't increment the version pre-update, and we can return completely >>> >> >> > > > > bogus results. >>> >> >> > > > Yes. >>> >> >> > > But then the getcpu test would fail (1->0). Even if you have an ABA >>> >> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the >>> >> >> > > one returned by the first getcpu. >>> >> >> > >>> >> >> > ... this case of partial update of pvti, which is caught by the version >>> >> >> > field, if of course different from the other (extremely unlikely) that >>> >> >> > Andy pointed out. That is when the getcpus are done on the same vCPU, >>> >> >> > but the rdtsc is another. >>> >> >> > >>> >> >> > That one can be fixed by rdtscp, like >>> >> >> > >>> >> >> > do { >>> >> >> > // get a consistent (pvti, v, tsc) tuple >>> >> >> > do { >>> >> >> > cpu = get_cpu(); >>> >> >> > pvti = get_pvti(cpu); >>> >> >> > v = pvti->version & ~1; >>> >> >> > // also acts as rmb(); >>> >> >> > rdtsc_barrier(); >>> >> >> > tsc = rdtscp(&cpu1); >>> >> >> >>> >> >> Off-topic note: rdtscp doesn't need a barrier at all. AIUI AMD >>> >> >> specified it that way and both AMD and Intel implement it correctly. >>> >> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.) >>> >> >> >>> >> >> > // control dependency, no need for rdtsc_barrier? >>> >> >> > } while(cpu != cpu1); >>> >> >> > >>> >> >> > // ... compute nanoseconds from pvti and tsc ... >>> >> >> > rmb(); >>> >> >> > } while(v != pvti->version); >>> >> >> >>> >> >> Still no good. We can migrate a bunch of times so we see the same CPU >>> >> >> all three times and *still* don't get a consistent read, unless we >>> >> >> play nasty games with lots of version checks (I have a patch for that, >>> >> >> but I don't like it very much). The patch is here: >>> >> >> >>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d >>> >> >> >>> >> >> but I don't like it. >>> >> >> >>> >> >> Thus far, I've been told unambiguously that a guest can't observe pvti >>> >> >> while it's being written, and I think you're now telling me that this >>> >> >> isn't true and that a guest *can* observe pvti while it's being >>> >> >> written while the low bit of the version field is not set. If so, >>> >> >> this is rather strongly incompatible with the spec in the KVM docs. >>> >> >> >>> >> >> I don't suppose that you and Marcelo could agree on what the actual >>> >> >> semantics that KVM provides are and could write it down in a way that >>> >> >> people who haven't spent a long time staring at the request code >>> >> >> understand? And maybe you could even fix the implementation while >>> >> >> you're at it if the implementation is, indeed, broken. I have ugly >>> >> >> patches to fix it here: >>> >> >> >>> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0 >>> >> >> >>> >> >> but I'm not thrilled with them. >>> >> >> >>> >> >> --Andy >>> >> > >>> >> > I suppose that separating the version write from the rest of the pvclock >>> >> > structure is sufficient, as that would guarantee the writes are not >>> >> > reordered even with fast string REP MOVS. >>> >> > >>> >> > Thanks for catching this Andy! >>> >> > >>> >> >>> >> Don't you stil need: >>> >> >>> >> version++; >>> >> write the rest; >>> >> version++; >>> >> >>> >> with possible smp_wmb() in there to keep the compiler from messing around? >>> > >>> > Correct. Could just as well follow the protocol and use odd/even, which >>> > is what your patch does. >>> > >>> > What is the point with the new flags bit though? >>> >>> To try to work around the problem on old hosts. I'm not at all >>> convinced that this is worthwhile or that it helps, though. >> >> Andy, >> >> Are you going to submit the fix or should i? >> > > I'd prefer if you did it. I'm not familiar enough with the KVM memory > management stuff to do it confidently. Feel free to mooch from my > patch if it's helpful. Any update here? I can try it myself if no one else wants to do it. --Andy > > --Andy > > -- > Andy Lutomirski > AMA Capital Management, LLC -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html