On Wed, Nov 29, 2017 at 04:42:16PM -0200, Eduardo Habkost wrote: > On Wed, Nov 29, 2017 at 12:44:42PM +0100, Paolo Bonzini wrote: > > On 29/11/2017 12:44, Eduardo Habkost wrote: > > > On Mon, Nov 13, 2017 at 09:32:09AM +0100, Paolo Bonzini wrote: > > >> On 13/11/2017 08:15, Wanpeng Li wrote: > > >>> 2017-11-10 17:49 GMT+08:00 Paolo Bonzini <pbonzini@xxxxxxxxxx>: > > >>>> Sometimes, a processor might execute an instruction while another > > >>>> processor is updating the page tables for that instruction's code page, > > >>>> but before the TLB shootdown completes. The interesting case happens > > >>>> if the page is in the TLB. > > >>>> > > >>>> In general, the processor will succeed in executing the instruction and > > >>>> nothing bad happens. However, what if the instruction is an MMIO access? > > >>>> If *that* happens, KVM invokes the emulator, and the emulator gets the > > >>>> updated page tables. If the update side had marked the code page as non > > >>>> present, the page table walk then will fail and so will x86_decode_insn. > > >>>> > > >>>> Unfortunately, even though kvm_fetch_guest_virt is correctly returning > > >>>> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as > > >>>> a fatal error if the instruction cannot simply be reexecuted (as is the > > >>>> case for MMIO). And this in fact happened sometimes when rebooting > > >>>> Windows 2012r2 guests. Just checking ctxt->have_exception and injecting > > >>>> the exception if true is enough to fix the case. > > >>> > > >>> I found the only place which can set ctxt->have_exception is in the > > >>> function x86_emulate_insn(), and x86_decode_insn() will not set > > >>> ctxt->have_exception even if kvm_fetch_guest_virt() returns > > >>> X86_EMUL_PROPAGATE_FAULT. > > >> > > >> Hmm, you're right. Looks like Yanan has been (un)lucky when trying out > > >> this patch! :( > > >> > > >> Yanan, can you double check that you can reproduce the issue with an > > >> unpatched kernel? I will work on a kvm-unit-tests testcsae > > > > > > We don't have a kvm-unit-tests reproducer for this yet, right? > > > > > > I'm considering trying to write one, but I don't want to > > > duplicate work. > > > > No, I haven't written one yet. > > The reproducer (not a full test case) is quite simple, see patch below. > > Now, I've noticed something interesting when running the > reproducer: There's something else that makes the bug hard to reproduce: as soon as I set RSP to a valid address in inregs before calling trap_emulator(), the bug is not reproducible anymore. But if I keep RSP=0, I won't be able to validate the bug fix because I won't be able to configure a working #PF handler. This alone makes the bug not reproducible anymore: diff --git a/x86/emulator.c b/x86/emulator.c index 72cb035..a7e61ff 100644 --- a/x86/emulator.c +++ b/x86/emulator.c @@ -1104,6 +1104,8 @@ static void test_illegal_movbe(void) static void test_fetch_failure(void *mem, void *alt_insn_page) { + void *stack = alloc_page(); + inregs = (struct regs){ .rsp = (u64)stack+1024 }; trap_emulator(mem, NULL, NULL); } This is what I see: When we don't have a stack (inregs.rsp=0), reexecute_instruction() is preventing the emulation failure from happening on the I/O instruction VM exits, and KVM keeps entering the VM in a loop (getting thousands of I/O instruction VM exits) until we finally get an EPT misconfig VM exit on GVA 0xfffffffffffffff8. When we set up inregs.rsp, reexecute_instruction() also prevents the emulation from failing on the I/O instruction VM exits, but instead of a EPT misconfig VM exit, we get EPT violation VM exit after a few thousand iterations, and the page fault is delivered to the VCPU. I don't know why KVM loops so many times on I/O instruction VM exits before finally getting an emulation failure (or finally delivering a page fault, if a stack is available), but this might explain why the bug is so hard to reproduce under normal circumstances. > > If the test_fetch_failure() call happens before we touch > pci-testdev through *mem (like in the patch below), we get an > emulation failure like the one Yanan saw: > > $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.RCPjppRp8i > enabling apic > paging enabled > cr0 = 80010011 > cr3 = 45e000 > cr4 = 20 > KVM internal error. Suberror: 1 > emulation failure > RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000 > RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000 > R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000 > R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000 > RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 > ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] > CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] > SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] > DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] > FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] > GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS [-WA] > LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT > TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy > GDT= 000000000041100a 0000047f > IDT= 0000000000000000 00000fff > CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020 > DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 > DR6=00000000ffff0ff0 DR7=0000000000000400 > EFER=0000000000000500 > Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? > > but if I call test_fetch_failure() after touching *mem, like this: > > diff --git a/x86/emulator.c b/x86/emulator.c > index 977ec75..72cb035 100644 > --- a/x86/emulator.c > +++ b/x86/emulator.c > @@ -1124,7 +1124,6 @@ int main() > alt_insn_page = alloc_page(); > insn_ram = vmap(virt_to_phys(insn_page), 4096); > > - test_fetch_failure(mem, alt_insn_page); > > // test mov reg, r/m and mov r/m, reg > t1 = 0x123456789abcdef; > @@ -1135,6 +1134,8 @@ int main() > : "memory"); > report("mov reg, r/m (1)", t2 == 0x123456789abcdef); > > + test_fetch_failure(mem, alt_insn_page); > + > test_simplealu(mem); > test_cmps(mem); > test_scas(mem); > > then I get a KVM_INTERNAL_ERROR_DELIVERY_EV: > > $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.lmXZa46TEA > enabling apic > paging enabled > cr0 = 80010011 > cr3 = 45e000 > cr4 = 20 > PASS: mov reg, r/m (1) > KVM internal error. Suberror: 3 > extra data[0]: 80000b0e > extra data[1]: 31 > extra data[2]: 182 > extra data[3]: ff000ff8 > RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000 > RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000 > R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000 > R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000 > RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 > ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] > CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] > SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] > DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] > FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] > GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS [-WA] > LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT > TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy > GDT= 000000000041100a 0000047f > IDT= 0000000000000000 00000fff > CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020 > DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 > DR6=00000000ffff0ff0 DR7=0000000000000400 > EFER=0000000000000500 > Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? > ^C > > Also, if I run the reproducer using ept=0, it gets stuck into a > loop re-entering the same "in (%dx),%al" instruction over and > over again. trace-cmd report output: > > qemu-system-x86-18185 [001] 1057573.830491: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 > qemu-system-x86-18185 [001] 1057573.830494: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 > qemu-system-x86-18185 [001] 1057573.830503: kvm_entry: vcpu 0 > qemu-system-x86-18185 [001] 1057573.830504: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 > qemu-system-x86-18185 [001] 1057573.830505: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 > qemu-system-x86-18185 [001] 1057573.830506: kvm_entry: vcpu 0 > qemu-system-x86-18185 [001] 1057573.830507: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 > qemu-system-x86-18185 [001] 1057573.830508: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 > qemu-system-x86-18185 [001] 1057573.830509: kvm_entry: vcpu 0 > qemu-system-x86-18185 [001] 1057573.830510: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 > qemu-system-x86-18185 [001] 1057573.830511: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 > qemu-system-x86-18185 [001] 1057573.830511: kvm_entry: vcpu 0 > qemu-system-x86-18185 [001] 1057573.830512: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 > qemu-system-x86-18185 [001] 1057573.830513: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 > qemu-system-x86-18185 [001] 1057573.830514: kvm_entry: vcpu 0 > qemu-system-x86-18185 [001] 1057573.830514: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 > qemu-system-x86-18185 [001] 1057573.830515: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 > qemu-system-x86-18185 [001] 1057573.830516: kvm_entry: vcpu 0 > qemu-system-x86-18185 [001] 1057573.830517: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 > qemu-system-x86-18185 [001] 1057573.830518: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 > qemu-system-x86-18185 [001] 1057573.830519: kvm_entry: vcpu 0 > qemu-system-x86-18185 [001] 1057573.830521: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 > qemu-system-x86-18185 [001] 1057573.830522: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 > qemu-system-x86-18185 [001] 1057573.830523: kvm_entry: vcpu 0 > [...] > > Signed-off-by: Eduardo Habkost <ehabkost@xxxxxxxxxx> > --- > x86/emulator.c | 21 +++++++++++++++++---- > 1 file changed, 17 insertions(+), 4 deletions(-) > > diff --git a/x86/emulator.c b/x86/emulator.c > index e6f27cc..977ec75 100644 > --- a/x86/emulator.c > +++ b/x86/emulator.c > @@ -792,9 +792,11 @@ static void trap_emulator(uint64_t *mem, void *alt_insn_page, > extern u8 insn_page[], test_insn[]; > > insn_ram = vmap(virt_to_phys(insn_page), 4096); > - memcpy(alt_insn_page, insn_page, 4096); > - memcpy(alt_insn_page + (test_insn - insn_page), > - (void *)(alt_insn->ptr), alt_insn->len); > + if (alt_insn_page) { > + memcpy(alt_insn_page, insn_page, 4096); > + memcpy(alt_insn_page + (test_insn - insn_page), > + (void *)(alt_insn->ptr), alt_insn->len); > + } > save = inregs; > > /* Load the code TLB with insn_page, but point the page tables at > @@ -805,7 +807,11 @@ static void trap_emulator(uint64_t *mem, void *alt_insn_page, > invlpg(insn_ram); > /* Load code TLB */ > asm volatile("call *%0" : : "r"(insn_ram)); > - install_page(cr3, virt_to_phys(alt_insn_page), insn_ram); > + if (alt_insn_page) { > + install_page(cr3, virt_to_phys(alt_insn_page), insn_ram); > + } else { > + install_pte(cr3, 1, insn_ram, PT_USER_MASK, 0); > + } > /* Trap, let hypervisor emulate at alt_insn_page */ > asm volatile("call *%0": : "r"(insn_ram+1)); > > @@ -1096,6 +1102,11 @@ static void test_illegal_movbe(void) > handle_exception(UD_VECTOR, 0); > } > > +static void test_fetch_failure(void *mem, void *alt_insn_page) > +{ > + trap_emulator(mem, NULL, NULL); > +} > + > int main() > { > void *mem; > @@ -1113,6 +1124,8 @@ int main() > alt_insn_page = alloc_page(); > insn_ram = vmap(virt_to_phys(insn_page), 4096); > > + test_fetch_failure(mem, alt_insn_page); > + > // test mov reg, r/m and mov r/m, reg > t1 = 0x123456789abcdef; > asm volatile("mov %[t1], (%[mem]) \n\t" > -- > 2.13.6 > > > -- > Eduardo -- Eduardo