On Wed, Aug 16, 2017 at 03:37:47PM +0200, Paolo Bonzini wrote: > On 16/08/2017 14:07, Radim Krčmář wrote: > > 2017-08-16 13:22+0200, Paolo Bonzini: > >> Microsoft pointed out privately to me that KVM's handling of > >> KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid > >> in EPT misconfiguration vmexit handlers, because neither EPT violations > >> nor misconfigurations are listed in the manual among the VM exits that > >> set the VM-exit instruction length field. > >> > >> While physical processors seem to set the field, this is not architectural > >> and is just a side effect of the implementation. I couldn't convince > >> myself of any condition on the exit qualification where VM-exit > >> instruction length "has" to be defined; there are no trap-like VM-exits > >> that can be repurposed; and fault-like VM-exits such as descriptor-table > >> exits provide no decoding information. So I don't really see any elegant > >> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio > >> 1 will go slower. > > > > Do you have some numbers? > > Raw number from vmexit.flat on Haswell-EP: > > mmio-no-eventfd:pci-mem 5793 > mmio-wildcard-eventfd:pci-mem 1395 > mmio-datamatch-eventfd:pci-mem 2268 > > So roughly 900 clock cycles. Most of the work is the four memory reads > done by x86_decode_insn, three to walk the page tables and one to fetch > the instruction. > > > We could keep the ugliness in KVM and add a new skip function with > > emulate_instruction(vcpu, EMULTYPE_SKIP) to decode the length of the > > instruction. (Adding a condition just for EPT violation exit reason to > > the existing skip function would be a dirtier solution.) > > Slower than what we have now, but faster than full emulation. > > This is actually a good idea, and not ugly at all! The main cost is > translating the physical address of the instruction and fetching the > bytes, so only 200 clock cycles are saved. We actually know what to expect (a write) so we could maybe optimize this some more with a dedicated function just for this. > > However, the eventfd is written before decoding, while full emulation > would write it after. So while VCPU thread latency is worse compared to > skip_emulated_instruction, latency to the iothread remains small. > > Paolo