On Thu, Mar 26, 2015 at 04:28:37PM -0700, Andy Lutomirski wrote: > On Thu, Mar 26, 2015 at 4:22 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: > > On Thu, Mar 26, 2015 at 04:09:53PM -0700, Andy Lutomirski wrote: > >> On Thu, Mar 26, 2015 at 3:56 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: > >> > On Thu, Mar 26, 2015 at 01:58:25PM -0700, Andy Lutomirski wrote: > >> >> On Thu, Mar 26, 2015 at 1:31 PM, Radim Krcmar <rkrcmar@xxxxxxxxxx> wrote: > >> >> > 2015-03-26 11:51-0700, Andy Lutomirski: > >> >> >> On Thu, Mar 26, 2015 at 4:29 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: > >> >> >> > On Wed, Mar 25, 2015 at 04:22:03PM -0700, Andy Lutomirski wrote: > >> >> >> >> Suppose we start out with all vcpus agreeing on their pvti and perfect > >> >> >> >> invariant TSCs. Now the host updates its frequency (due to NTP or > >> >> >> >> whatever). KVM updates vcpu 0's pvti. Before KVM updates vcpu 1's > >> >> >> >> pvti, guest code on vcpus 0 and 1 see synced TSCs but different pvti. > >> >> >> >> They'll disagree on the time, and one of them will be ahead until vcpu > >> >> >> >> 1's pvti gets updated. > >> >> >> > > >> >> >> > The masterclock scheme enforces the same system_timestamp/tsc_timestamp pairs > >> >> >> > to be visible at one time, for all vcpus. > >> >> >> > > >> >> >> > > >> >> >> > * That is, when timespec0 != timespec1, M < N. Unfortunately that is > >> >> >> > * not > >> >> >> > * always the case (the difference between two distinct xtime instances > >> >> >> > * might be smaller then the difference between corresponding TSC reads, > >> >> >> > * when updating guest vcpus pvclock areas). > >> >> >> > * > >> >> >> > * To avoid that problem, do not allow visibility of distinct > >> >> >> > * system_timestamp/tsc_timestamp values simultaneously: use a master > >> >> >> > * copy of host monotonic time values. Update that master copy > >> >> >> > * in lockstep. > >> >> >> > >> >> >> Yuck. So we have per cpu timing data, but the protocol is only usable > >> >> >> for monotonic timing because we forcibly freeze all vcpus when we > >> >> >> update the nominally per cpu data. > >> >> >> > >> >> >> The obvious guest implementations are still unnecessarily slow, > >> >> >> though. It would be nice if the guest could get away without using > >> >> >> any getcpu operation at all. > >> >> >> > >> >> >> Even if we fixed the host to increment version as advertised, I think > >> >> >> we can't avoid two getcpu ops. We need one before rdtsc to figure out > >> >> >> which pvti to look at, > >> >> > > >> >> > Yes. > >> >> > > >> >> >> and we need another to make sure that we were > >> >> >> actually on that cpu at the time we did rdtsc. (Rdtscp doesn't help > >> >> >> -- we need to check version before rdtsc, and we don't know what > >> >> >> version to check until we do a getcpu.). > >> >> > > >> >> > Exactly, reading cpuid after rdtsc doesn't do that though, we could have > >> >> > migrated back between those reads. > >> >> > rtdscp would allow us to check that we read tsc of pvti's cpu. > >> >> > (It doesn't get rid of that first read.) > >> >> > > >> >> >> The migration hook has the > >> >> >> same issue -- we need to check the migration count, then confirm we're > >> >> >> on that cpu, then check the migration count again, and we can't do > >> >> >> that until we know what cpu we're on. > >> >> > > >> >> > True; the revert has a bug -- we need to check cpuid for the second > >> >> > time before rdtsc. (Migration hook is there just because we don't know > >> >> > which cpu executed rdtsc.) > >> >> > >> >> One way or another, I'm planning on completely rewriting the vdso > >> >> code. An early draft is here: > >> >> > >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso&id=57ace6e6e032afc4faf7b9ec52f78a8e6642c980 > >> >> > >> >> but I can't finish it until the KVM side shakes out. > >> >> > >> >> I think there are at least two ways that would work: > >> >> > >> >> a) If KVM incremented version as advertised: > >> > > >> > All for it. > >> > > >> >> cpu = getcpu(); > >> >> pvti = pvti for cpu; > >> >> > >> >> ver1 = pvti->version; > >> >> check stable bit; > >> >> rdtsc_barrier, rdtsc, read scale, shift, etc. > >> >> if (getcpu() != cpu) retry; > >> >> if (pvti->version != ver1) retry; > >> >> > >> >> I think this is safe because, we're guaranteed that there was an > >> >> interval (between the two version reads) in which the vcpu we think > >> >> we're on was running and the kvmclock data was valid and marked > >> >> stable, and we know that the tsc we read came from that interval. > >> >> > >> >> Note: rdtscp isn't needed. If we're stable, is makes no difference > >> >> which cpu's tsc we actually read. > >> > > >> > Yes, can't see a problem with that. > >> > > >> >> b) If version remains buggy but we use this migrations_from hack: > >> > > >> > There is no reason for version to remain buggy. > >> > > >> >> cpu = getcpu(); > >> >> pvti = pvti for cpu; > >> >> m1 = pvti->migrations_from; > >> >> barrier(); > >> >> > >> >> ver1 = pvti->version; > >> >> check stable bit; > >> >> rdtsc_barrier, rdtsc, read scale, shift, etc. > >> >> if (getcpu() != cpu) retry; > >> >> if (pvti->version != ver1) retry; /* probably not really needed */ > >> >> > >> >> barrier(); > >> >> if (pvti->migrations_from != m1) retry; > >> >> > >> >> This is just like (a), except that we're using a guest kernel hack to > >> >> ensure that no one migrated off the vcpu during the version-protected > >> >> critical section and that we were, in fact, on that vcpu at some point > >> >> during that critical section. Once we've ensured that we were on > >> >> pvti's associated vcpu for the entire time we were reading it, then we > >> >> are protected by the existing versioning in the host. > >> >> > >> >> > > >> >> >> If, on the other hand, we could rely on having all of these things in > >> >> >> sync, then this complication goes away, and we go down from two getcpu > >> >> >> ops to zero. > >> >> > > >> >> > (Yeah, we should look what are the drawbacks of doing it differently.) > >> >> > >> >> If the versioning were fixed, I think we could almost get away with: > >> >> > >> >> pvti = pvti for vcpu 0; > >> >> > >> >> ver1 = pvti->version; > >> >> check stable bit; > >> >> rdtsc_barrier, rdtsc, read scale, shift, etc. > >> >> if (pvti->version != ver1) retry; > >> >> > >> >> This guarantees that the tsc came from an interval in which vcpu0's > >> >> kvmclock was *marked* stable. If vcpu0's kvmclock were genuinely > >> >> stable in that interval, then we'd be fine, but there's a race window > >> >> in which the kvmclock is *not* stable and vcpu 0 wasn't running. > >> > > >> > What is that window again ? Have no objections against using vcpu0's > >> > pvti (cacheline should be read-only 99.9% of time). > >> > >> This is based on my (mis?)understanding of the code. Here goes. > >> > >> Suppose we transition from stable to unstable. The host freezes all > >> vcpus and set a flag for each vcpu that the kvmclock data needs > >> updating. There could then be a window in which vcpu 1 runs vdso code > >> and vcpu 0 hasn't updated its kvmclock data yet. > >> > >> I don't know whether this is actually possible. Rik suspects it isn't. > > > > I don't see why its not possible. We can force vcpu0's kvmclock to be > > updated. > > > > Do you have an estimation of the performance gain of using vcpu0 pvti? > > rdtscp is approximately 3 cycles worse than rdtsc_barrier(); rdtsc() > on my machine, but that doesn't help much, since the current scheme > requires that we get the cpu number before reading the time. The > fastest way I know of to get the cpu number is 25-ish cycles, so this > ends up being a large fraction of the overall cost. It should be possible to force vcpu0 kvmclock area to be updated before vm-entry, on the update of any other vcpuN kvmclock area. kvm_make_mclock_inprogress_request(kvm); /* no guest entries from this point */ pvclock_update_vm_gtod_copy(kvm); <---- HERE -------> kvm_for_each_vcpu(i, vcpu, kvm) kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); /* guest entries allowed */ kvm_for_each_vcpu(i, vcpu, kvm) clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests); > >> >> Why doesn't KVM just update all of the kvmclock data at once? > >> > > >> > Because it has not been necessary -- updating kvmclock data on vcpu > >> > entry was the previous method, so that was reused. > >> > > >> >> (For > >> >> that matter, why is the pvti in guest memory at all? Wouldn't this > >> >> all be simpler if the kvmclock data were host-allocated so the host > >> >> could write it directly and maybe even share it between guests?) > >> > > >> > And use a 4K TLB entry for that kvmclock area rather than > >> > sharing one of kernel's 2MB (or 1GB) TLB entry? > >> > >> Exactly. I'd also move it into the vvar area instead of using the > >> fixmap so 32-bit userspace could use it. > > > > Thats an obvious downside (an additional entry in the TLB occupied just > > for the kvmclock area?). > > True. But TLB misses are actually pretty cheap on new hardware, and > it's still only one TLB entry per process. > >> I'm more than happy to handle the vdso side of all of this, but I'd > >> like the host code to settle down first. > >> I'm also not quite sure whether it's okay to cause the vdso timing > >> code to regress on old hosts with new gusts. > > > > Must be backwards compatible. > > > > Could we add a new feature bit that indicates that the host is > updated? Then we could use the new code on new hosts and the old code > on old hosts and eventually deprecate accelerated vdso timing on old > hosts. Makes sense. -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html