On 17/09/2015 21:23, Andy Lutomirski wrote: > On Thu, Sep 17, 2015 at 1:10 PM, Konrad Rzeszutek Wilk > <konrad.wilk@xxxxxxxxxx> wrote: >> On Wed, Sep 16, 2015 at 06:39:03PM -0400, Cole Robinson wrote: >>> On 09/16/2015 05:08 PM, Konrad Rzeszutek Wilk wrote: >>>> On Wed, Sep 16, 2015 at 05:04:31PM -0400, Cole Robinson wrote: >>>>> On 09/16/2015 04:07 PM, M A Young wrote: >>>>>> On Wed, 16 Sep 2015, Cole Robinson wrote: >>>>>> >>>>>>> Unfortunately I couldn't get anything else extra out of xen using any of these >>>>>>> options or the ones Major recommended... in fact I couldn't get anything to >>>>>>> the serial console at all. console=con1 would seem to redirect messages since >>>>>>> they wouldn't show up on the graphical display, but nothing went to the serial >>>>>>> log. Maybe I'm missing something... >>>>>> That should be console=com1 so you have a typo either in this message or >>>>>> in your tests. >>>>>> >>>>> Yeah that was it :/ So here's the crash output use -cpu host: >>>>> >>>>> - Cole >>>>> >>> <snip> >>> >>>>> about to get started... >>>>> (XEN) traps.c:459:d0v0 Unhandled general protection fault fault/trap [#13] on >>>>> VCPU 0 [ec=0000] >>>>> (XEN) domain_crash_sync called from entry.S: fault at ffff82d08023a5d3 >>>>> create_bounce_frame+0x12b/0x13a >>>>> (XEN) Domain 0 (vcpu#0) crashed on cpu#0: >>>>> (XEN) ----[ Xen-4.5.1 x86_64 debug=n Not tainted ]---- >>>>> (XEN) CPU: 0 >>>>> (XEN) RIP: e033:[<ffffffff810032b0>] >>>> That is the Linux kernel EIP. Can you figure out what is at ffffffff810032b0 ? >>>> >>>> gdb vmlinux and then >>>> x/20i 0xffffffff810032b0 >>>> >>>> can help with that. >>>> >>> Updated to the latest kernel 4.1.6-201.fc22.x86_64. Trace is now: >>> >>> about to get started... >>> (XEN) traps.c:459:d0v0 Unhandled general protection fault fault/trap [#13] on >>> VCPU 0 [ec=0000] > What exactly does this mean? This means that there was #GP fault originating from dom0 context, but dom0 has not yet registered a #GP handler with Xen. (I already have a patch pending to correct the wording of that error message.) Would be a double fault on native. > >>> (XEN) domain_crash_sync called from entry.S: fault at ffff82d08023a5d3 >>> create_bounce_frame+0x12b/0x13a >>> (XEN) Domain 0 (vcpu#0) crashed on cpu#0: >>> (XEN) ----[ Xen-4.5.1 x86_64 debug=n Not tainted ]---- >>> (XEN) CPU: 0 >>> (XEN) RIP: e033:[<ffffffff810031f0>] >>> (XEN) RFLAGS: 0000000000000282 EM: 1 CONTEXT: pv guest >>> (XEN) rax: 0000000000000015 rbx: ffffffff81c03e1c rcx: 00000000c0010112 >>> (XEN) rdx: 0000000000000001 rsi: ffffffff81c03e1c rdi: 00000000c0010112 >>> (XEN) rbp: ffffffff81c03df8 rsp: ffffffff81c03da0 r8: ffffffff81c03e28 >>> (XEN) r9: ffffffff81c03e2c r10: 0000000000000000 r11: 00000000ffffffff >>> (XEN) r12: ffffffff81d25a60 r13: 0000000004000000 r14: 0000000000000000 >>> (XEN) r15: 0000000000000000 cr0: 0000000080050033 cr4: 00000000000406f0 >>> (XEN) cr3: 0000000075c0b000 cr2: 0000000000000000 >>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 >>> (XEN) Guest stack trace from rsp=ffffffff81c03da0: >>> (XEN) 00000000c0010112 00000000ffffffff 0000000000000000 ffffffff810031f0 >>> (XEN) 000000010000e030 0000000000010082 ffffffff81c03de0 000000000000e02b >>> (XEN) 0000000000000000 000000000000000c ffffffff81c03e1c ffffffff81c03e48 >>> (XEN) ffffffff8102a7a4 ffffffff81c03e48 ffffffff8102aa3b ffffffff81c03e48 >>> (XEN) cf1fa5f5e026f464 0000000001000000 ffffffff81c03ef8 0000000004000000 >>> (XEN) 0000000000000000 ffffffff81c03e58 ffffffff81d5d142 ffffffff81c03ee8 >>> (XEN) ffffffff81d58b56 0000000000000000 0000000000000000 ffffffff81c03e88 >>> (XEN) ffffffff810f8a39 ffffffff81c03ee8 ffffffff81798b13 ffffffff00000010 >>> (XEN) ffffffff81c03ef8 ffffffff81c03eb8 cf1fa5f5e026f464 ffffffff81f1de9c >>> (XEN) ffffffffffffffff 0000000000000000 ffffffff81df7920 0000000000000000 >>> (XEN) 0000000000000000 ffffffff81c03f28 ffffffff81d51c74 cf1fa5f5e026f464 >>> (XEN) 0000000000000000 ffffffff81c03f60 ffffffff81c03f5c 0000000000000000 >>> (XEN) 0000000000000000 ffffffff81c03f38 ffffffff81d51339 ffffffff81c03ff8 >>> (XEN) ffffffff81d548b1 0000000000000000 00600f1200000000 0000000100000800 >>> (XEN) 0300000100000032 0000000000000005 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> (XEN) 0f00000060c0c748 ccccccccccccc305 cccccccccccccccc cccccccccccccccc >>> (XEN) Domain 0 crashed: rebooting machine in 5 seconds. >>> >>> >>> gdb output: >>> >>> (gdb) x/20i 0xffffffff810031f0 >>> 0xffffffff810031f0 <xen_read_msr_safe+16>: rdmsr >> Fantastic! So we have some rdmsr that makes KVM inject an >> GP. > What's the scenario? Is this Xen on KVM? I believe from the thread that this is a Xen/dom0 combo running as a KVM guest. > > Why didn't the guest print anything? Lack of earlyprintk=xen on the dom0 command line. (IMO this really should be the default when a PVOPs detects that it is running under Xen) > > Is the issue here that the guest died due to failure to handle an > RDMSR failure or did the *hypervisor* die? The guest suffered a GP fault which it couldn't handle. Therefore Xen crashed the domain. When dom0 crashes, Xen goes down too. > > It looks like null_trap_bounce is returning true, which suggests that > the failure is happening before the guest sets up exception handling. I concur. > >> Looking at the stack you have some other values: >> ffffffff81c03de0, ffffffff81c03e1c .. they should correspond >> to other functions calling this one. If you do 'nm --defined vmlinux | grep ffffffff81c03e1' >> that should give an idea where they are. Or use 'gdb'. >> >> That will give us an stack - and we can find what type of MSR >> this is. Oh wait, it is on the registers: 00000000c0010112 >> >> Ok, so where in the code is that MSR ah, that looks to be: >> #define MSR_K8_TSEG_ADDR 0xc0010112 >> >> which is called at bsp_init_amd. >> >> I think the problem here is that we are calling the >> 'safe' variant of MSR but we still get an injected #GP and >> don't expect that. >> >> I am not really sure what the expected outcome should be here. >> >> CC-ing xen-devel, KVM folks, and Andy who has been looking >> in mucking around in the _safe* pvops. > It's too early of a failure, I think. > > Cc: Borislav. Is TSEG guaranteed to exist? Can we defer that until > we have exception handling working? Do we need to rig up exception > handling so that it works earlier (e.g. in early_trap_init, which is > presumably early enough)? Or is this just a KVM and/or Xen bug. It would certainly help to move the exception setup as early as possible. >From a Xen PV guests point of view, the kernel is already executing on working pagetables and flat GDT when it starts. A set_trap_table hypercall (equivalent of `lidt`) ought to be the second action, following the stack switch. This appears not to be the case, and the load_idt() is deferred until native cpu_init(). ~Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html