Hi all, On 7/10/19 10:31 PM, Jan Kiszka wrote: > On 10.07.19 18:05, Jan Kiszka wrote: >> Hi KarimAllah, >> >> On 10.07.19 17:24, Raslan, KarimAllah wrote: >>> On Mon, 2019-07-08 at 22:39 +0200, Jan Kiszka wrote: >>>> Hi all, >>>> >>>> it seems the "new" KVM_SET_NESTED_STATE interface has some remaining >>>> robustness issues. >>> >>> I would be very interested to learn about any more robustness issues that you >>> are seeing. >>> >>>> The most urgent one: With the help of latest QEMU >>>> master that uses this interface, you can easily crash the host. You just >>>> need to start qemu-system-x86 -enable-kvm in L1 and then hard-reset L1. >>>> The host CPU that ran this will stall, the system will freeze soon. >>> >>> Just to confirm, you start an L2 guest using qemu inside an L1-guest and then >>> hard-reset the L1 guest? >> >> Exactly. >> >>> >>> Are you running any special workload in L2 or L1 when you reset? Also how >> >> Nope. It is a standard (though rather oldish) userland in L1, just running a >> more recent kernel 5.2. >> >>> exactly are you doing this "hard reset"? >> >> system_reset from the monitor or "reset" from QEMU window menu. While I'm not able to reproduce this behaviour on any of my machines (i7-4810MQ, i7-5600U, Xeon Gold 5118), >> >>> >>> (sorry just tried this in my setup and I did not see any problem but my setup >>> is slightly different, so just ruling out obvious stuff). >>> >> >> If it helps, I can share privately a guest image that was built via >> https://github.com/siemens/jailhouse-images which exposes the reset issue after >> starting Jailhouse (instead of qemu-system-x86_64 - though that should "work" as >> well, just not tested yet). It's about 70M packed. >> >> Host-wise, 5.2.0 + QEMU master should do. I can also provide you the .config if >> needed. I can reproduce and confirm this issue. A system_reset of qemu after Jailhouse is enabled leads to the crash listed below, on all machines. On the Xeon Gold, e.g., Qemu reports: EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000f61 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000 EIP=0000fff0 EFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 0000ffff 00009300 CS =f000 ffff0000 0000ffff 00a09b00 SS =0000 00000000 0000ffff 00c09300 DS =0000 00000000 0000ffff 00009300 FS =0000 00000000 0000ffff 00009300 GS =0000 00000000 0000ffff 00009300 LDT=0000 00000000 0000ffff 00008200 TR =0000 00000000 0000ffff 00008b00 GDT= 00000000 0000ffff IDT= 00000000 0000ffff CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000680 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=00 66 89 d8 66 e8 af a1 ff ff 66 83 c4 0c 66 5b 66 5e 66 c3 <ea> 5b e0 00 f0 30 36 2f 32 33 2f 39 39 00 fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Kernel: [ 1868.804515] kvm: vmptrld (null)/6b8640000000 failed [ 1868.804568] kvm: vmclear fail: (null)/6b8640000000 And the host freezes unrecoverably. Hosts use standard distro kernels >= v5.0. Ralf >> >>>> >>>> I've also seen a pattern with my Jailhouse test VM where I seems to get >>>> stuck in a loop between L1 and L2: >>>> >>>> qemu-system-x86-6660 [007] 398.691401: kvm_nested_vmexit: rip 7fa9ee5224e4 reason IO_INSTRUCTION info1 5658000b info2 0 int_info 0 int_info_err 0 >>>> qemu-system-x86-6660 [007] 398.691402: kvm_fpu: unload >>>> qemu-system-x86-6660 [007] 398.691403: kvm_userspace_exit: reason KVM_EXIT_IO (2) >>>> qemu-system-x86-6660 [007] 398.691440: kvm_fpu: load >>>> qemu-system-x86-6660 [007] 398.691441: kvm_pio: pio_read at 0x5658 size 4 count 1 val 0x4 >>>> qemu-system-x86-6660 [007] 398.691443: kvm_mmu_get_page: existing sp gfn 3a22e 1/4 q3 direct --x !pge !nxe root 6 sync >>>> qemu-system-x86-6660 [007] 398.691444: kvm_entry: vcpu 3 >>>> qemu-system-x86-6660 [007] 398.691475: kvm_exit: reason IO_INSTRUCTION rip 0x7fa9ee5224e4 info 5658000b 0 >>>> qemu-system-x86-6660 [007] 398.691476: kvm_nested_vmexit: rip 7fa9ee5224e4 reason IO_INSTRUCTION info1 5658000b info2 0 int_info 0 int_info_err 0 >>>> qemu-system-x86-6660 [007] 398.691477: kvm_fpu: unload >>>> qemu-system-x86-6660 [007] 398.691478: kvm_userspace_exit: reason KVM_EXIT_IO (2) >>>> qemu-system-x86-6660 [007] 398.691526: kvm_fpu: load >>>> qemu-system-x86-6660 [007] 398.691527: kvm_pio: pio_read at 0x5658 size 4 count 1 val 0x4 >>>> qemu-system-x86-6660 [007] 398.691529: kvm_mmu_get_page: existing sp gfn 3a22e 1/4 q3 direct --x !pge !nxe root 6 sync >>>> qemu-system-x86-6660 [007] 398.691530: kvm_entry: vcpu 3 >>>> qemu-system-x86-6660 [007] 398.691533: kvm_exit: reason IO_INSTRUCTION rip 0x7fa9ee5224e4 info 5658000b 0 >>>> qemu-system-x86-6660 [007] 398.691534: kvm_nested_vmexit: rip 7fa9ee5224e4 reason IO_INSTRUCTION info1 5658000b info2 0 int_info 0 int_info_err 0 >>>> >>>> These issues disappear when going from ebbfef2f back to 6cfd7639 (both >>>> with build fixes) in QEMU. >>> >>> This is the QEMU that you are using in L0 to launch an L1 guest, right? or are >>> you still referring to the QEMU mentioned above? >> >> This scenario is similar but still a bit different than the above. Yes, same L0 >> image and host QEMU here (and the traces were taken on the host, obviously), but >> the workload is now as follows: >> >> - boot L1 Linux >> - enable Jailhouse inside L1 >> - move the mouse over the graphical desktop of L2, ie. the former L1 >> Linux (Jailhouse is now L1) >> - the L1/L2 guests enter the loop above while trying to read from the >> vmmouse port >> >> Jan >> > > Ralf tried my case on some of his systems as well but he also didn't succeed in > reproducing. So we compared vmxcap lists because I'm starting to think it's > feature-related. There are some differences... > > --- vmxcap.i7-5600u 2019-07-10 21:59:05.616547924 +0200 > +++ vmxcap.jan 2019-07-10 21:58:23.135686409 +0200 > @@ -1,6 +1,6 @@ > Basic VMX Information > - Hex: 0xda040000000012 > - Revision 18 > + Hex: 0xda040000000004 > + Revision 4 > VMCS size 1024 > VMCS restricted to 32 bit addresses no > Dual-monitor support yes > @@ -51,13 +51,13 @@ > Enable INVPCID yes > Enable VM functions yes > VMCS shadowing yes > - Enable ENCLS exiting no > + Enable ENCLS exiting yes > RDSEED exiting yes > - Enable PML no > + Enable PML yes > EPT-violation #VE yes > - Conceal non-root operation from PT no > - Enable XSAVES/XRSTORS no > - Mode-based execute control (XS/XU) no > + Conceal non-root operation from PT yes > + Enable XSAVES/XRSTORS yes > + Mode-based execute control (XS/XU) yes > TSC scaling no > VM-Exit controls > Save debug controls default > @@ -69,8 +69,8 @@ > Save IA32_EFER yes > Load IA32_EFER yes > Save VMX-preemption timer value yes > - Clear IA32_BNDCFGS no > - Conceal VM exits from PT no > + Clear IA32_BNDCFGS yes > + Conceal VM exits from PT yes > VM-Entry controls > Load debug controls default > IA-32e mode guest yes > @@ -79,11 +79,11 @@ > Load IA32_PERF_GLOBAL_CTRL yes > Load IA32_PAT yes > Load IA32_EFER yes > - Load IA32_BNDCFGS no > - Conceal VM entries from PT no > + Load IA32_BNDCFGS yes > + Conceal VM entries from PT yes > Miscellaneous data > - Hex: 0x300481e5 > - VMX-preemption timer scale (log2) 5 > + Hex: 0x7004c1e7 > + VMX-preemption timer scale (log2) 7 > Store EFER.LMA into IA-32e mode guest control yes > HLT activity state yes > Shutdown activity state yes > @@ -93,10 +93,10 @@ > MSR-load/store count recommendation 0 > IA32_SMM_MONITOR_CTL[2] can be set to 1 yes > VMWRITE to VM-exit information fields yes > - Inject event with insn length=0 no > + Inject event with insn length=0 yes > MSEG revision identifier 0 > VPID and EPT capabilities > - Hex: 0xf0106334141 > + Hex: 0xf0106734141 > Execute-only EPT translations yes > Page-walk length 4 yes > Paging-structure memory type UC yes > > And another machine that does not crash: > > --- vmxcaps.e5-2683v4 2019-07-10 22:21:28.620329384 +0200 > +++ vmxcap.jan 2019-07-10 21:58:23.135686409 +0200 > @@ -1,6 +1,6 @@ > Basic VMX Information > - Hex: 0xda040000000012 > - Revision 18 > + Hex: 0xda040000000004 > + Revision 4 > VMCS size 1024 > VMCS restricted to 32 bit addresses no > Dual-monitor support yes > @@ -12,7 +12,7 @@ > NMI exiting yes > Virtual NMIs yes > Activate VMX-preemption timer yes > - Process posted interrupts yes > + Process posted interrupts no > primary processor-based controls > Interrupt window exiting yes > Use TSC offsetting yes > @@ -44,20 +44,20 @@ > Enable VPID yes > WBINVD exiting yes > Unrestricted guest yes > - APIC register emulation yes > - Virtual interrupt delivery yes > + APIC register emulation no > + Virtual interrupt delivery no > PAUSE-loop exiting yes > RDRAND exiting yes > Enable INVPCID yes > Enable VM functions yes > VMCS shadowing yes > - Enable ENCLS exiting no > + Enable ENCLS exiting yes > RDSEED exiting yes > Enable PML yes > EPT-violation #VE yes > - Conceal non-root operation from PT no > - Enable XSAVES/XRSTORS no > - Mode-based execute control (XS/XU) no > + Conceal non-root operation from PT yes > + Enable XSAVES/XRSTORS yes > + Mode-based execute control (XS/XU) yes > TSC scaling no > VM-Exit controls > Save debug controls default > @@ -69,8 +69,8 @@ > Save IA32_EFER yes > Load IA32_EFER yes > Save VMX-preemption timer value yes > - Clear IA32_BNDCFGS no > - Conceal VM exits from PT no > + Clear IA32_BNDCFGS yes > + Conceal VM exits from PT yes > VM-Entry controls > Load debug controls default > IA-32e mode guest yes > @@ -79,11 +79,11 @@ > Load IA32_PERF_GLOBAL_CTRL yes > Load IA32_PAT yes > Load IA32_EFER yes > - Load IA32_BNDCFGS no > - Conceal VM entries from PT no > + Load IA32_BNDCFGS yes > + Conceal VM entries from PT yes > Miscellaneous data > - Hex: 0x300481e5 > - VMX-preemption timer scale (log2) 5 > + Hex: 0x7004c1e7 > + VMX-preemption timer scale (log2) 7 > Store EFER.LMA into IA-32e mode guest control yes > HLT activity state yes > Shutdown activity state yes > @@ -93,10 +93,10 @@ > MSR-load/store count recommendation 0 > IA32_SMM_MONITOR_CTL[2] can be set to 1 yes > VMWRITE to VM-exit information fields yes > - Inject event with insn length=0 no > + Inject event with insn length=0 yes > MSEG revision identifier 0 > VPID and EPT capabilities > - Hex: 0xf0106334141 > + Hex: 0xf0106734141 > Execute-only EPT translations yes > Page-walk length 4 yes > Paging-structure memory type UC yes > > And on a Xeon D-1540, I'm not seeing a crash but a kvm entry failure when > resetting L1 while running Jailhouse: > > KVM: entry failed, hardware error 0x7 > EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000f61 > ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000 > EIP=0000fff0 EFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 > ES =0000 00000000 0000ffff 00009300 > CS =f000 ffff0000 0000ffff 00a09b00 > SS =0000 00000000 0000ffff 00c09300 > DS =0000 00000000 0000ffff 00009300 > FS =0000 00000000 0000ffff 00009300 > GS =0000 00000000 0000ffff 00009300 > LDT=0000 00000000 0000ffff 00008200 > TR =0000 00000000 0000ffff 00008b00 > GDT= 00000000 0000ffff > IDT= 00000000 0000ffff > CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000680 > DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 > DR6=00000000ffff0ff0 DR7=0000000000000400 > EFER=0000000000000000 > Code=00 66 89 d8 66 e8 af a1 ff ff 66 83 c4 0c 66 5b 66 5e 66 c3 <ea> 5b e0 00 > f0 30 36 2f 32 33 2f 39 39 00 fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Here is the vmxcap diff: > > --- xeon-d 2019-07-10 22:29:56.735374032 +0200 > +++ i7-8850H 2019-07-10 22:29:31.747467248 +0200 > @@ -1,6 +1,6 @@ > Basic VMX Information > - Hex: 0xda040000000012 > - Revision 18 > + Hex: 0xda040000000004 > + Revision 4 > VMCS size 1024 > VMCS restricted to 32 bit addresses no > Dual-monitor support yes > @@ -12,7 +12,7 @@ pin-based controls > NMI exiting yes > Virtual NMIs yes > Activate VMX-preemption timer yes > - Process posted interrupts yes > + Process posted interrupts no > primary processor-based controls > Interrupt window exiting yes > Use TSC offsetting yes > @@ -44,20 +44,20 @@ secondary processor-based controls > Enable VPID yes > WBINVD exiting yes > Unrestricted guest yes > - APIC register emulation yes > - Virtual interrupt delivery yes > + APIC register emulation no > + Virtual interrupt delivery no > PAUSE-loop exiting yes > RDRAND exiting yes > Enable INVPCID yes > Enable VM functions yes > VMCS shadowing yes > - Enable ENCLS exiting no > + Enable ENCLS exiting yes > RDSEED exiting yes > Enable PML yes > EPT-violation #VE yes > - Conceal non-root operation from PT no > - Enable XSAVES/XRSTORS no > - Mode-based execute control (XS/XU) no > + Conceal non-root operation from PT yes > + Enable XSAVES/XRSTORS yes > + Mode-based execute control (XS/XU) yes > TSC scaling no > VM-Exit controls > Save debug controls default > @@ -69,8 +69,8 @@ VM-Exit controls > Save IA32_EFER yes > Load IA32_EFER yes > Save VMX-preemption timer value yes > - Clear IA32_BNDCFGS no > - Conceal VM exits from PT no > + Clear IA32_BNDCFGS yes > + Conceal VM exits from PT yes > VM-Entry controls > Load debug controls default > IA-32e mode guest yes > @@ -79,11 +79,11 @@ VM-Entry controls > Load IA32_PERF_GLOBAL_CTRL yes > Load IA32_PAT yes > Load IA32_EFER yes > - Load IA32_BNDCFGS no > - Conceal VM entries from PT no > + Load IA32_BNDCFGS yes > + Conceal VM entries from PT yes > Miscellaneous data > - Hex: 0x300481e5 > - VMX-preemption timer scale (log2) 5 > + Hex: 0x7004c1e7 > + VMX-preemption timer scale (log2) 7 > Store EFER.LMA into IA-32e mode guest control yes > HLT activity state yes > Shutdown activity state yes > @@ -93,10 +93,10 @@ Miscellaneous data > MSR-load/store count recommendation 0 > IA32_SMM_MONITOR_CTL[2] can be set to 1 yes > VMWRITE to VM-exit information fields yes > - Inject event with insn length=0 no > + Inject event with insn length=0 yes > MSEG revision identifier 0 > VPID and EPT capabilities > - Hex: 0xf0106334141 > + Hex: 0xf0106734141 > Execute-only EPT translations yes > Page-walk length 4 yes > Paging-structure memory type UC yes > > Maybe the KVM code does not take the latest VMX features into account when > importing a userspace nested state? > > Jan >