On Mon, Jan 05, 2015 at 02:38:46PM -0800, Andy Lutomirski wrote: > On Mon, Jan 5, 2015 at 11:17 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: > > On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote: > >> On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: > >> > On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote: > >> >> The pvclock vdso code was too abstracted to understand easily and > >> >> excessively paranoid. Simplify it for a huge speedup. > >> >> > >> >> This opens the door for additional simplifications, as the vdso no > >> >> longer accesses the pvti for any vcpu other than vcpu 0. > >> >> > >> >> Before, vclock_gettime using kvm-clock took about 64ns on my machine. > >> >> With this change, it takes 19ns, which is almost as fast as the pure TSC > >> >> implementation. > >> >> > >> >> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxxxxxx> > >> >> --- > >> >> arch/x86/vdso/vclock_gettime.c | 82 ++++++++++++++++++++++++------------------ > >> >> 1 file changed, 47 insertions(+), 35 deletions(-) > >> >> > >> >> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c > >> >> index 9793322751e0..f2e0396d5629 100644 > >> >> --- a/arch/x86/vdso/vclock_gettime.c > >> >> +++ b/arch/x86/vdso/vclock_gettime.c > >> >> @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) > >> >> > >> >> static notrace cycle_t vread_pvclock(int *mode) > >> >> { > >> >> - const struct pvclock_vsyscall_time_info *pvti; > >> >> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti; > >> >> cycle_t ret; > >> >> - u64 last; > >> >> - u32 version; > >> >> - u8 flags; > >> >> - unsigned cpu, cpu1; > >> >> - > >> >> + u64 tsc, pvti_tsc; > >> >> + u64 last, delta, pvti_system_time; > >> >> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift; > >> >> > >> >> /* > >> >> - * Note: hypervisor must guarantee that: > >> >> - * 1. cpu ID number maps 1:1 to per-CPU pvclock time info. > >> >> - * 2. that per-CPU pvclock time info is updated if the > >> >> - * underlying CPU changes. > >> >> - * 3. that version is increased whenever underlying CPU > >> >> - * changes. > >> >> + * Note: The kernel and hypervisor must guarantee that cpu ID > >> >> + * number maps 1:1 to per-CPU pvclock time info. > >> >> + * > >> >> + * Because the hypervisor is entirely unaware of guest userspace > >> >> + * preemption, it cannot guarantee that per-CPU pvclock time > >> >> + * info is updated if the underlying CPU changes or that that > >> >> + * version is increased whenever underlying CPU changes. > >> >> + * > >> >> + * On KVM, we are guaranteed that pvti updates for any vCPU are > >> >> + * atomic as seen by *all* vCPUs. This is an even stronger > >> >> + * guarantee than we get with a normal seqlock. > >> >> * > >> >> + * On Xen, we don't appear to have that guarantee, but Xen still > >> >> + * supplies a valid seqlock using the version field. > >> >> + > >> >> + * We only do pvclock vdso timing at all if > >> >> + * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to > >> >> + * mean that all vCPUs have matching pvti and that the TSC is > >> >> + * synced, so we can just look at vCPU 0's pvti. > >> >> */ > >> > > >> > Can Xen guarantee that ? > >> > >> I think so, vacuously. Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT > >> at all. I have no idea going forward, though. > >> > >> Xen people? > >> > >> > > >> >> - do { > >> >> - cpu = __getcpu() & VGETCPU_CPU_MASK; > >> >> - /* TODO: We can put vcpu id into higher bits of pvti.version. > >> >> - * This will save a couple of cycles by getting rid of > >> >> - * __getcpu() calls (Gleb). > >> >> - */ > >> >> - > >> >> - pvti = get_pvti(cpu); > >> >> - > >> >> - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags); > >> >> - > >> >> - /* > >> >> - * Test we're still on the cpu as well as the version. > >> >> - * We could have been migrated just after the first > >> >> - * vgetcpu but before fetching the version, so we > >> >> - * wouldn't notice a version change. > >> >> - */ > >> >> - cpu1 = __getcpu() & VGETCPU_CPU_MASK; > >> >> - } while (unlikely(cpu != cpu1 || > >> >> - (pvti->pvti.version & 1) || > >> >> - pvti->pvti.version != version)); > >> >> - > >> >> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT))) > >> >> + > >> >> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) { > >> >> *mode = VCLOCK_NONE; > >> >> + return 0; > >> >> + } > >> > > >> > This check must be performed after reading a stable pvti. > >> > > >> > >> We can even read it in the middle, guarded by the version checks. > >> I'll do that for v2. > >> > >> >> + > >> >> + do { > >> >> + version = pvti->version; > >> >> + > >> >> + /* This is also a read barrier, so we'll read version first. */ > >> >> + rdtsc_barrier(); > >> >> + tsc = __native_read_tsc(); > >> >> + > >> >> + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul; > >> >> + pvti_tsc_shift = pvti->tsc_shift; > >> >> + pvti_system_time = pvti->system_time; > >> >> + pvti_tsc = pvti->tsc_timestamp; > >> >> + > >> >> + /* Make sure that the version double-check is last. */ > >> >> + smp_rmb(); > >> >> + } while (unlikely((version & 1) || version != pvti->version)); > >> >> + > >> >> + delta = tsc - pvti_tsc; > >> >> + ret = pvti_system_time + > >> >> + pvclock_scale_delta(delta, pvti_tsc_to_system_mul, > >> >> + pvti_tsc_shift); > >> > > >> > The following is possible: > >> > > >> > 1) State: all pvtis marked as PVCLOCK_TSC_STABLE_BIT. > >> > 1) Update request for all vcpus, for a TSC_STABLE_BIT -> ~TSC_STABLE_BIT > >> > transition. > >> > 2) vCPU-1 updates its pvti with new values. > >> > 3) vCPU-0 still has not updated its pvti with new values. > >> > 4) vCPU-1 VM-enters, uses vCPU-0 values, even though it has been > >> > notified of a TSC_STABLE_BIT -> ~TSC_STABLE_BIT transition. > >> > > >> > The update is not actually atomic across all vCPUs, its atomic in > >> > the sense of not allowing visibility of distinct > >> > system_timestamp/tsc_timestamp values. > >> > > >> > >> Hmm. In step 4, is there a guarantee that vCPU-0 won't VM-enter until > >> it gets marked unstable? > > > > Yes. It will VM-enter after pvti is updated. > > > >> Otherwise the vdso could could just as > >> easily be called from vCPU-1, migrated to vCPU-0, read the data > >> complete with stale stable bit, and get migrated back to vCPU-1. > > > > Right. > > > >> But I thought that KVM currently froze all vCPUs when updating pvti > >> for any of them. How can this happen? I admit I don't really > >> understand the update request code. > > > > The update is performed as follows: > > > > - Stop guest instruction execution on every vCPU, parking them in the host. > > - Request KVMCLOCK update for every vCPU. > > - Resume guest instruction execution. > > > > The KVMCLOCK update (==pvti update) is guaranteed to be performed before > > guest instructions are executed again. > > > > But there is no guarantee that vCPU-N has updated its pvti when > > vCPU-M resumes guest instruction execution. > > Still confused. So we can freeze all vCPUs in the host, then update > pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM > doesn't increment the version pre-update, and we can return completely > bogus results. Yes. > > So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ? > > It removes a whole bunch of code, an extra barrier, and two __getcpus. > > > Perhaps you can use Gleb's idea to stick vcpu id into version field ? > > I don't understand how that's useful at all. If you're reading pvti, > you clearly know the vcpu id. Replace the return value of __getcpus by value read from pvti.version. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html