Re: [PATCH] kvm: x86: disable KVM_FAST_MMIO_BUS

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Wed, 16 Aug 2017 17:06:53 +0300

On Wed, Aug 16, 2017 at 03:37:47PM +0200, Paolo Bonzini wrote:
> On 16/08/2017 14:07, Radim Krčmář wrote:
> > 2017-08-16 13:22+0200, Paolo Bonzini:
> >> Microsoft pointed out privately to me that KVM's handling of
> >> KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
> >> in EPT misconfiguration vmexit handlers, because neither EPT violations
> >> nor misconfigurations are listed in the manual among the VM exits that
> >> set the VM-exit instruction length field.
> >>
> >> While physical processors seem to set the field, this is not architectural
> >> and is just a side effect of the implementation.  I couldn't convince
> >> myself of any condition on the exit qualification where VM-exit
> >> instruction length "has" to be defined; there are no trap-like VM-exits
> >> that can be repurposed; and fault-like VM-exits such as descriptor-table
> >> exits provide no decoding information.  So I don't really see any elegant
> >> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
> >> 1 will go slower.
> > 
> > Do you have some numbers?
> 
> Raw number from vmexit.flat on Haswell-EP:
> 
> mmio-no-eventfd:pci-mem 5793
> mmio-wildcard-eventfd:pci-mem 1395
> mmio-datamatch-eventfd:pci-mem 2268
> 
> So roughly 900 clock cycles.  Most of the work is the four memory reads
> done by x86_decode_insn, three to walk the page tables and one to fetch
> the instruction.
> 
> > We could keep the ugliness in KVM and add a new skip function with
> > emulate_instruction(vcpu, EMULTYPE_SKIP) to decode the length of the
> > instruction.  (Adding a condition just for EPT violation exit reason to
> > the existing skip function would be a dirtier solution.)
> > Slower than what we have now, but faster than full emulation.
> 
> This is actually a good idea, and not ugly at all!  The main cost is
> translating the physical address of the instruction and fetching the
> bytes, so only 200 clock cycles are saved.

We actually know what to expect (a write) so we could maybe
optimize this some more with a dedicated function just for this.

> 
> However, the eventfd is written before decoding, while full emulation
> would write it after. So while VCPU thread latency is worse compared to
> skip_emulated_instruction, latency to the iothread remains small.
> 
> Paolo