Avi Kivity wrote: > Marcelo Tosatti wrote: >> I think comparison is not entirely fair. You're using >> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that >> (on Intel) to only one register read: >> >> nr = kvm_register_read(vcpu, VCPU_REGS_RAX); >> >> Whereas in a real hypercall for (say) PIO you would need the address, >> size, direction and data. >> > > Well, that's probably one of the reasons pio is slower, as the cpu has > to set these up, and the kernel has to read them. Right, that was the point I was trying to make. Its real-world overhead to measure how long it takes KVM to go round-trip in each of the respective trap types. > >> Also for PIO/MMIO you're adding this unoptimized lookup to the >> measurement: >> >> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); >> if (pio_dev) { >> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); >> complete_pio(vcpu); return 1; >> } >> > > Since there are only one or two elements in the list, I don't see how > it could be optimized. To Marcelo's point, I think he was more taking exception to the fact that the HC path was potentially completely optimized out if GCC was super-intuitive about the switch(nr) statement hitting the null vector. In theory, both the io_bus and the select(nr) are about equivalent in algorithmic complexity (and depth, I should say) which is why I think in general the test is "fair". IOW it represents the real-world decode cycle function for each transport. However, if one side was artificially optimized simply due to the triviality of my NULLIO test, that is not fair, and that is the point I believe he was making. In any case, I just wrote a new version of the test which hopefully addresses forces GCC to leave it as a more real-world decode. (FYI: I saw no difference). I will update the tarball/wiki shortly. > >> Whereas for hypercall measurement you don't. I believe a fair comparison >> would be have a shared guest/host memory area where you store guest/host >> TSC values and then do, on guest: >> >> rdtscll(&shared_area->guest_tsc); >> pio/mmio/hypercall >> ... back to host >> rdtscll(&shared_area->host_tsc); >> >> And then calculate the difference (minus guests TSC_OFFSET of course)? >> > > I don't understand why you want host tsc? We're interested in > round-trip latency, so you want guest tsc all the time. Yeah, I agree. My take is he was just trying to introduce a real workload so GCC wouldn't do that potential "cheater decode" in the HC path. After thinking about it, however, I realized we could do that with a simple "state++" operation, so the new test does this in each of the various test's "execute" cycle. The timing calculation remains unchanged. -Greg
Attachment:
signature.asc
Description: OpenPGP digital signature