Re: x86: kvm: Revert "remove sched notifier for cross-cpu migrations"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 26, 2015 at 4:22 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
> On Thu, Mar 26, 2015 at 04:09:53PM -0700, Andy Lutomirski wrote:
>> On Thu, Mar 26, 2015 at 3:56 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
>> > On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote:
>> >> On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar <rkrcmar@xxxxxxxxxx> wrote:
>> >> > 2015-03-26 11:51-0700, Andy Lutomirski:
>> >> >> On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
>> >> >> > On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote:
>> >> >> >> Suppose we start out with all vcpus agreeing on their pvti and perfect
>> >> >> >> invariant TSCs.  Now the host updates its frequency (due to NTP or
>> >> >> >> whatever).  KVM updates vcpu 0's pvti.  Before KVM updates vcpu 1's
>> >> >> >> pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti.
>> >> >> >> They'll disagree on the time, and one of them will be ahead until vcpu
>> >> >> >> 1's pvti gets updated.
>> >> >> >
>> >> >> > The masterclock scheme enforces the same system_timestamp/tsc_timestamp pairs
>> >> >> > to be visible at one time, for all vcpus.
>> >> >> >
>> >> >> >
>> >> >> >  * That is, when timespec0 != timespec1, M < N. Unfortunately that is
>> >> >> >  * not
>> >> >> >  * always the case (the difference between two distinct xtime instances
>> >> >> >  * might be smaller then the difference between corresponding TSC reads,
>> >> >> >  * when updating guest vcpus pvclock areas).
>> >> >> >  *
>> >> >> >  * To avoid that problem, do not allow visibility of distinct
>> >> >> >  * system_timestamp/tsc_timestamp values simultaneously: use a master
>> >> >> >  * copy of host monotonic time values. Update that master copy
>> >> >> >  * in lockstep.
>> >> >>
>> >> >> Yuck.  So we have per cpu timing data, but the protocol is only usable
>> >> >> for monotonic timing because we forcibly freeze all vcpus when we
>> >> >> update the nominally per cpu data.
>> >> >>
>> >> >> The obvious guest implementations are still unnecessarily slow,
>> >> >> though.  It would be nice if the guest could get away without using
>> >> >> any getcpu operation at all.
>> >> >>
>> >> >> Even if we fixed the host to increment version as advertised, I think
>> >> >> we can't avoid two getcpu ops.  We need one before rdtsc to figure out
>> >> >> which pvti to look at,
>> >> >
>> >> > Yes.
>> >> >
>> >> >>                        and we need another to make sure that we were
>> >> >> actually on that cpu at the time we did rdtsc.  (Rdtscp doesn't help
>> >> >> -- we need to check version before rdtsc, and we don't know what
>> >> >> version to check until we do a getcpu.).
>> >> >
>> >> > Exactly, reading cpuid after rdtsc doesn't do that though, we could have
>> >> > migrated back between those reads.
>> >> > rtdscp would allow us to check that we read tsc of pvti's cpu.
>> >> > (It doesn't get rid of that first read.)
>> >> >
>> >> >>                                          The migration hook has the
>> >> >> same issue -- we need to check the migration count, then confirm we're
>> >> >> on that cpu, then check the migration count again, and we can't do
>> >> >> that until we know what cpu we're on.
>> >> >
>> >> > True;  the revert has a bug -- we need to check cpuid for the second
>> >> > time before rdtsc.  (Migration hook is there just because we don't know
>> >> > which cpu executed rdtsc.)
>> >>
>> >> One way or another, I'm planning on completely rewriting the vdso
>> >> code.  An early draft is here:
>> >>
>> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso&id=57ace6e6e032afc4faf7b9ec52f78a8e6642c980
>> >>
>> >> but I can't finish it until the KVM side shakes out.
>> >>
>> >> I think there are at least two ways that would work:
>> >>
>> >> a) If KVM incremented version as advertised:
>> >
>> > All for it.
>> >
>> >> cpu = getcpu();
>> >> pvti = pvti for cpu;
>> >>
>> >> ver1 = pvti->version;
>> >> check stable bit;
>> >> rdtsc_barrier, rdtsc, read scale, shift, etc.
>> >> if (getcpu() != cpu) retry;
>> >> if (pvti->version != ver1) retry;
>> >>
>> >> I think this is safe because, we're guaranteed that there was an
>> >> interval (between the two version reads) in which the vcpu we think
>> >> we're on was running and the kvmclock data was valid and marked
>> >> stable, and we know that the tsc we read came from that interval.
>> >>
>> >> Note: rdtscp isn't needed. If we're stable, is makes no difference
>> >> which cpu's tsc we actually read.
>> >
>> > Yes, can't see a problem with that.
>> >
>> >> b) If version remains buggy but we use this migrations_from hack:
>> >
>> > There is no reason for version to remain buggy.
>> >
>> >> cpu = getcpu();
>> >> pvti = pvti for cpu;
>> >> m1 = pvti->migrations_from;
>> >> barrier();
>> >>
>> >> ver1 = pvti->version;
>> >> check stable bit;
>> >> rdtsc_barrier, rdtsc, read scale, shift, etc.
>> >> if (getcpu() != cpu) retry;
>> >> if (pvti->version != ver1) retry;  /* probably not really needed */
>> >>
>> >> barrier();
>> >> if (pvti->migrations_from != m1) retry;
>> >>
>> >> This is just like (a), except that we're using a guest kernel hack to
>> >> ensure that no one migrated off the vcpu during the version-protected
>> >> critical section and that we were, in fact, on that vcpu at some point
>> >> during that critical section.  Once we've ensured that we were on
>> >> pvti's associated vcpu for the entire time we were reading it, then we
>> >> are protected by the existing versioning in the host.
>> >>
>> >> >
>> >> >> If, on the other hand, we could rely on having all of these things in
>> >> >> sync, then this complication goes away, and we go down from two getcpu
>> >> >> ops to zero.
>> >> >
>> >> > (Yeah, we should look what are the drawbacks of doing it differently.)
>> >>
>> >> If the versioning were fixed, I think we could almost get away with:
>> >>
>> >> pvti = pvti for vcpu 0;
>> >>
>> >> ver1 = pvti->version;
>> >> check stable bit;
>> >> rdtsc_barrier, rdtsc, read scale, shift, etc.
>> >> if (pvti->version != ver1) retry;
>> >>
>> >> This guarantees that the tsc came from an interval in which vcpu0's
>> >> kvmclock was *marked* stable.  If vcpu0's kvmclock were genuinely
>> >> stable in that interval, then we'd be fine, but there's a race window
>> >> in which the kvmclock is *not* stable and vcpu 0 wasn't running.
>> >
>> > What is that window again ? Have no objections against using vcpu0's
>> > pvti (cacheline should be read-only 99.9% of time).
>>
>> This is based on my (mis?)understanding of the code.  Here goes.
>>
>> Suppose we transition from stable to unstable.  The host freezes all
>> vcpus and set a flag for each vcpu that the kvmclock data needs
>> updating.  There could then be a window in which vcpu 1 runs vdso code
>> and vcpu 0 hasn't updated its kvmclock data yet.
>>
>> I don't know whether this is actually possible.  Rik suspects it isn't.
>
> I don't see why its not possible. We can force vcpu0's kvmclock to be
> updated.
>
> Do you have an estimation of the performance gain of using vcpu0 pvti?

rdtscp is approximately 3 cycles worse than rdtsc_barrier(); rdtsc()
on my machine, but that doesn't help much, since the current scheme
requires that we get the cpu number before reading the time.  The
fastest way I know of to get the cpu number is 25-ish cycles, so this
ends up being a large fraction of the overall cost.

>
>> >> Why doesn't KVM just update all of the kvmclock data at once?
>> >
>> > Because it has not been necessary -- updating kvmclock data on vcpu
>> > entry was the previous method, so that was reused.
>> >
>> >> (For
>> >> that matter, why is the pvti in guest memory at all?  Wouldn't this
>> >> all be simpler if the kvmclock data were host-allocated so the host
>> >> could write it directly and maybe even share it between guests?)
>> >
>> > And use a 4K TLB entry for that kvmclock area rather than
>> > sharing one of kernel's 2MB (or 1GB) TLB entry?
>>
>> Exactly.  I'd also move it into the vvar area instead of using the
>> fixmap so 32-bit userspace could use it.
>
> Thats an obvious downside (an additional entry in the TLB occupied just
> for the kvmclock area?).

True.  But TLB misses are actually pretty cheap on new hardware, and
it's still only one TLB entry per process.

>
>> I'm more than happy to handle the vdso side of all of this, but I'd
>> like the host code to settle down first.
>> I'm also not quite sure whether it's okay to cause the vdso timing
>> code to regress on old hosts with new gusts.
>
> Must be backwards compatible.
>

Could we add a new feature bit that indicates that the host is
updated?  Then we could use the new code on new hosts and the old code
on old hosts and eventually deprecate accelerated vdso timing on old
hosts.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]