Re: [RFC 1/1] KVM: selftests: rseq_test: use vdso_getcpu() instead of syscall()

Robert Hoo <robert.hu@xxxxxxxxxxxxxxx> · Thu, 03 Nov 2022 10:59:46 +0800




On Thu, 2022-11-03 at 00:46 +0000, Sean Christopherson wrote:
> On Wed, Nov 02, 2022, Robert Hoo wrote:
> > vDSO getcpu() has been in Kernel since 2.6.19, which we can assume
> > generally available.
> > Use vDSO getcpu() to reduce the overhead, so that vcpu thread
> > stalls less
> > therefore can have more odds to hit the race condition.
> > 
> > Fixes: 0fcc102923de ("KVM: selftests: Use getcpu() instead of
> > sched_getcpu() in rseq_test")
> > Signed-off-by: Robert Hoo <robert.hu@xxxxxxxxxxxxxxx>
> > ---
> 
> ...
> 
> > @@ -253,7 +269,7 @@ int main(int argc, char *argv[])
> >  			 * across the seq_cnt reads.
> >  			 */
> >  			smp_rmb();
> > -			sys_getcpu(&cpu);
> > +			vdso_getcpu(&cpu, NULL, NULL);
> >  			rseq_cpu = rseq_current_cpu_raw();
> >  			smp_rmb();
> >  		} while (snapshot != atomic_read(&seq_cnt));
> 
> Something seems off here.  Half of the iterations in the migration
> thread have a
> delay of 5+us, which should be more than enough time to complete a
> few getcpu()
> syscalls to stabilize the CPU.
> 
The migration thread delay time is for the whole vcpu thread loop, not
just vcpu_run(), I think.

for (i = 0; !done; i++) {
		vcpu_run(vcpu);
		TEST_ASSERT(get_ucall(vcpu, NULL) == UCALL_SYNC,
			    "Guest failed?");
...
		do {
			...
			vdso_getcpu(&cpu, NULL, NULL);
			rseq_cpu = rseq_current_cpu_raw();
			...
		} while (snapshot != atomic_read(&seq_cnt));

...
	}

> Has anyone tried to figure out why the vCPU thread is apparently
> running slow?
> E.g. is KVM_RUN itself taking a long time, is the task not getting
> scheduled in,
> etc...  I can see how using vDSO would make the vCPU more efficient,
> but I'm
> curious as to why that's a problem in the first place.

Yes, it should be the first-place problem.
But firstly, it's the whole for(){} loop taking more time than before,
that increment can be attributed to those key sub-calls, e.g.
vcpu_run(), get_ucall(), getcpu(), rseq_current_cpu_raw().

Though vcpu_run() should have first attention, reduce others' time
spending also helps.

BTW, I find that x86 get_ucall() have a more vcpu ioctl
(vcpu_regs_get()) than aarch64's, this perhaps explains a little why
the for(){} loop is heavier than aarch64.

uint64_t get_ucall(struct kvm_vcpu *vcpu, struct ucall *uc)
@@ -43,12 +95,14 @@
 	if (uc)
 		memset(uc, 0, sizeof(*uc));
 
-	if (run->exit_reason == KVM_EXIT_IO && run->io.port ==
UCALL_PIO_PORT) {
-		struct kvm_regs regs;
-
-		vcpu_regs_get(vcpu, &regs);
-		memcpy(&ucall, addr_gva2hva(vcpu->vm,
(vm_vaddr_t)regs.rdi),
-		       sizeof(ucall));
+	if (run->exit_reason == KVM_EXIT_MMIO &&
+	    run->mmio.phys_addr == (uint64_t)ucall_exit_mmio_addr) {
+		vm_vaddr_t gva;
+
+		TEST_ASSERT(run->mmio.is_write && run->mmio.len == 8,
+			    "Unexpected ucall exit mmio address
access");
+		memcpy(&gva, run->mmio.data, sizeof(gva));
+		memcpy(&ucall, addr_gva2hva(vcpu->vm, gva),
sizeof(ucall));

> 
> Anyways, assuming there's no underlying problem that can be solved,
> the easier
> solution is to just bump the delay in the migration thread.  As per
> its gigantic
> comment, the original bug reproduced with up to 500us delays, so
> bumping the min
> delay to e.g. 5us is acceptable.  If that doesn't guarantee the vCPU
> meets its
> quota, then something else is definitely going on.