On Tue, Jan 06, 2015 at 11:18:21PM -0800, Andy Lutomirski wrote: > On Tue, Jan 6, 2015 at 9:38 PM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > > > > > > On 06/01/2015 17:56, Andy Lutomirski wrote: > >> Still no good. We can migrate a bunch of times so we see the same CPU > >> all three times > > > > There are no three times. The CPU you see here: > > > >>> > >>> > >>> // ... compute nanoseconds from pvti and tsc ... > >>> rmb(); > >>> } while(v != pvti->version); > > > > is the same you read here: > > > >>> cpu = get_cpu(); > > > > The algorithm is: > > I still don't see why this is safe, and I think that the issue is that > you left out part of the loop. > > > > > 1) get a consistent (cpu, version, tsc) > > > > 1.a) get cpu > > Suppose we observe cpu 0. > > > 1.b) get pvti[cpu]->version, ignoring low bit > > Missing step, presumably here: read pvti[cpu]->tsc_timestamp, scale, > etc. This could all execute on vCPU 1. We could read values that are > inconsistent with each other. > > > 1.c) get (tsc, cpu) > > Now we could end up back on vCPU 0. > > > 1.d) if cpu from 1.a and 1.c do not match, loop > > 1.e) if pvti[cpu] was being updated, we'll loop later > > > > 2) compute nanoseconds from pvti[cpu] and tsc > > > > 3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't > > match, retry. > > > > As long as the CPU is consistent between get_cpu() and rdtscp(), there > > is no problem with migration, because pvti is always accessed for that > > one CPU. If there were any problem, it would be caught by the version > > check. Writing it down with two nested do...whiles makes it clearer IMHO. > > Why exactly would it be caught by the version check? > > My ugly patch tries to make the argument that, at any point at which > we observe ourselves to be on a given vCPU, that vCPU won't be > updating pvti. That means that, if version doesn't change between two > consecutive observations that we're on that vCPU, then we're okay. > This IMO sucks. It's fragile, it's hard to make a coherent argument > about correctness, and it requires at least two getcpu-like operations > to read the time. Those operations are *slow*. One is much better > than two, and zero is much better than one. > > > > >> and *still* don't get a consistent read, unless we > >> play nasty games with lots of version checks (I have a patch for that, > >> but I don't like it very much). The patch is here: > >> > >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d > >> > >> but I don't like it. > >> > >> Thus far, I've been told unambiguously that a guest can't observe pvti > >> while it's being written, and I think you're now telling me that this > >> isn't true and that a guest *can* observe pvti while it's being > >> written while the low bit of the version field is not set. If so, > >> this is rather strongly incompatible with the spec in the KVM docs. > > > > Where am I saying that? > > I thought the conclusion from what you and Marcelo pointed out about > the code was that, once the first vCPU updated its pvti, it could > start running guest code while the other vCPUs are still updating > pvti, so its guest code can observe the other vCPUs mid-update. > > >> Also, if you do this, can you also make setting and clearing > >> STABLE_BIT properly atomic across all vCPUs? Or at least do something > >> like setting it last and clearing it first on vPCU 0? > > > > That would be nice if you want to make the pvclock area fit in a single > > page. However, it would have to be a separate flag bit, or a separate > > CPUID feature. > > It would be nice to have. Although I think that fixing the host to > increment version pre-update and post-update may actually be good > enough. Is there any case in which it would fail in practice if we > made that fix and always looked at pvti 0? TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition steps would finish but still allow VCPU-1 to use stale values from VCPU-0. To fix, do one of the following: 1) Check validity of local TSC_STABLE_BIT in addition (slow). 2) Perform update of VCPU-0 pvclock area before allowing any other VCPU to VM-entry. > > --Andy > > > > > Paolo > > > > -- > Andy Lutomirski > AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html