Dmitry, Can you share the host kernel version? I can not reproduce any of these crash signatures and I think it's really a nested virtualization bug. So I will need the exact host kernel version as well. I am currently getting all sorts of: "KVM: entry failed, hardware error 0x7" ... instead of the crash signatures that you are posting. Regards. On Sat, 2018-06-30 at 08:09 +0000, Raslan, KarimAllah wrote: > Looking also at the other crash [0]: > > msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap; > ffffffff811f65b7: e8 44 cb 57 00 callq ffffffff81773100 > <__sanitizer_cov_trace_pc> > ffffffff811f65bc: 48 8b 54 24 08 mov 0x8(%rsp),%rdx > ffffffff811f65c1: 48 b8 00 00 00 00 00 movabs > $0xdffffc0000000000,%rax > ffffffff811f65c8: fc ff df > ffffffff811f65cb: 48 c1 ea 03 shr $0x3,%rdx > ffffffff811f65cf: 80 3c 02 > 00 cmpb $0x0,(%rdx,%rax,1) <- fault here. > ffffffff811f65d3: 0f 85 36 19 00 00 jne ffffffff811f7f0f > <vmx_vcpu_run+0x236f> > > %rdx should contain a pointer to loaded_vmcs. It is directly loaded > from the stack [0x8(%rsp)]. This same stack location was just used > before the inlined assembly for VMRESUME/VMLAUNCH here: > > vmx->__launched = vmx->loaded_vmcs->launched; > ffffffff811f639f: e8 5c cd 57 00 callq ffffffff81773100 > <__sanitizer_cov_trace_pc> > ffffffff811f63a4: 48 8b 54 24 08 mov 0x8(%rsp),%rdx > ffffffff811f63a9: 48 b8 00 00 00 00 00 movabs > $0xdffffc0000000000,%rax > ffffffff811f63b0: fc ff df > ffffffff811f63b3: 48 c1 ea 03 shr $0x3,%rdx > ffffffff811f63b7: 80 3c 02 > 00 cmpb $0x0,(%rdx,%rax,1) <- used here. > > ... and this stack location was never touched by anything in between! > So something must have corrupted the stack itself not really the > kvm_vc > pu struct. > > Obviously the inlined assembly block is using the stack as well, but I > can not see anything that would cause this corruption there. > > That being said, looking at the %rsp and %rbp values that are dumped > in the stack trace: > > RSP: ffff8801b7d7f380 > RBP: ffff8801b8260140 > > ... they are almost 4.8 MiB apart! Should not these two register be a > bit closer to each other? :) > > So 2 possibilities here: > > 1- %rsp is wrong > > That would explain why the loaded_vmcs was NULL. However, it is a bit > harder to understand how it became wrong! It should have been restored > during the VMEXIT from the HOST_RSP value in the VMCS! > > Is this a nested setup? > > 2- %rbp is wrong > > That would also explain why the loaded_vmcs was NULL. Whatever > corrupted the stack that caused loaded_vmcs to be NULL could have also > corrupted the %rbp saved in the stack. That would mean that it happened > during a function call. All function calls that happened between the > point when the stack was sane (just before the "asm" block for > VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I > can not see where the stack would get corrupted though! Obviously > another source of corruption can be a completely unrelated thread > directly corruption this thread's memory. > > Maybe it would be easier to just try to repro it first and see which > one is true (if at all). > > [0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550 > > > On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote: > > > > 22: 0f 01 c3 vmresume > > 25: 48 89 4c 24 08 mov %rcx,0x8(%rsp) > > 2a: 59 pop %rcx > > > > <rip>: > > 2b: 0f 96 81 88 56 00 00 setbe 0x5688(%rcx) > > 32: 48 89 81 00 03 00 00 mov %rax,0x300(%rcx) > > 39: 48 89 99 18 03 00 00 mov %rbx,0x318(%rcx) > > > > %rcx should be pointing to the vcpu_vmx structure, but it's not even > > canonical: 1ffff10035842e78. > > Amazon Development Center Germany GmbH Berlin - Dresden - Aachen main office: Krausenstr. 38, 10117 Berlin Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger Ust-ID: DE289237879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B