Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Wed, 7 Jan 2015 12:45:03 -0200

On Tue, Jan 06, 2015 at 11:18:21PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 6, 2015 at 9:38 PM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
> >
> >
> > On 06/01/2015 17:56, Andy Lutomirski wrote:
> >> Still no good.  We can migrate a bunch of times so we see the same CPU
> >> all three times
> >
> > There are no three times.  The CPU you see here:
> >
> >>>
> >>>
> >>>     // ... compute nanoseconds from pvti and tsc ...
> >>>     rmb();
> >>> }   while(v != pvti->version);
> >
> > is the same you read here:
> >
> >>>         cpu = get_cpu();
> >
> > The algorithm is:
> 
> I still don't see why this is safe, and I think that the issue is that
> you left out part of the loop.
> 
> >
> > 1) get a consistent (cpu, version, tsc)
> >
> >    1.a) get cpu
> 
> Suppose we observe cpu 0.
> 
> >    1.b) get pvti[cpu]->version, ignoring low bit
> 
> Missing step, presumably here: read pvti[cpu]->tsc_timestamp, scale,
> etc.  This could all execute on vCPU 1.  We could read values that are
> inconsistent with each other.
> 
> >    1.c) get (tsc, cpu)
> 
> Now we could end up back on vCPU 0.
> 
> >    1.d) if cpu from 1.a and 1.c do not match, loop
> >    1.e) if pvti[cpu] was being updated, we'll loop later
> >
> > 2) compute nanoseconds from pvti[cpu] and tsc
> >
> > 3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't
> > match, retry.
> >
> > As long as the CPU is consistent between get_cpu() and rdtscp(), there
> > is no problem with migration, because pvti is always accessed for that
> > one CPU.  If there were any problem, it would be caught by the version
> > check.  Writing it down with two nested do...whiles makes it clearer IMHO.
> 
> Why exactly would it be caught by the version check?
> 
> My ugly patch tries to make the argument that, at any point at which
> we observe ourselves to be on a given vCPU, that vCPU won't be
> updating pvti.  That means that, if version doesn't change between two
> consecutive observations that we're on that vCPU, then we're okay.
> This IMO sucks.  It's fragile, it's hard to make a coherent argument
> about correctness, and it requires at least two getcpu-like operations
> to read the time.  Those operations are *slow*.  One is much better
> than two, and zero is much better than one.
> 
> >
> >> and *still* don't get a consistent read, unless we
> >> play nasty games with lots of version checks (I have a patch for that,
> >> but I don't like it very much).  The patch is here:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> >>
> >> but I don't like it.
> >>
> >> Thus far, I've been told unambiguously that a guest can't observe pvti
> >> while it's being written, and I think you're now telling me that this
> >> isn't true and that a guest *can* observe pvti while it's being
> >> written while the low bit of the version field is not set.  If so,
> >> this is rather strongly incompatible with the spec in the KVM docs.
> >
> > Where am I saying that?
> 
> I thought the conclusion from what you and Marcelo pointed out about
> the code was that, once the first vCPU updated its pvti, it could
> start running guest code while the other vCPUs are still updating
> pvti, so its guest code can observe the other vCPUs mid-update.
> 
> >> Also, if you do this, can you also make setting and clearing
> >> STABLE_BIT properly atomic across all vCPUs?  Or at least do something
> >> like setting it last and clearing it first on vPCU 0?
> >
> > That would be nice if you want to make the pvclock area fit in a single
> > page.  However, it would have to be a separate flag bit, or a separate
> > CPUID feature.
> 
> It would be nice to have.  Although I think that fixing the host to
> increment version pre-update and post-update may actually be good
> enough.  Is there any case in which it would fail in practice if we
> made that fix and always looked at pvti 0?

TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition steps would finish but 
still allow VCPU-1 to use stale values from VCPU-0.

To fix, do one of the following:

1) Check validity of local TSC_STABLE_BIT in addition (slow).
2) Perform update of VCPU-0 pvclock area before allowing
any other VCPU to VM-entry.

> 
> --Andy
> 
> >
> > Paolo
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html