On 05/19/2017 09:14 AM, Roman Penyaev wrote:
Hi folks, After experiencing guest double faults (sometimes triple faults) on 3.16 guest kernels with the following common pattern: [459395.776124] PANIC: double fault, error_code: 0x0 [459395.776606] CPU: 0 PID: 36565 Comm: fio Not tainted 3.16.39kmemleak #4 [459395.776610] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083029-gandalf 04/01/2014 [459395.776614] task: ffff880009ca06b0 ti: ffff88003cbc2000 task.ti: ffff88003cbc2000 [459395.776617] RIP: 0010:[<ffffffff81048ecf>] [<ffffffff81048ecf>] __do_page_fault+0x1f/0x540 [459395.776628] RSP: 002b:00007ffe0bc9bfa8 EFLAGS: 00010012 [459395.776631] RAX: 0000000081539927 RBX: 0000000000000000 RCX: ffffffff81539927 [459395.776634] RDX: 0000000000000028 RSI: 0000000000000000 RDI: 00007ffe0bc9c0a8 [459395.776637] RBP: 00007ffe0bc9c0a8 R08: 0001a1d1002e9400 R09: 0000000000063f1b [459395.776640] R10: 0000000033f8356c R11: 000029c8250c3103 R12: 0000000000000028 [459395.776642] R13: 00007ff8c83e0000 R14: 0000000000000000 R15: 00007ffe0bc9c7c0 [459395.776649] FS: 00007ff8d2aaa7c0(0000) GS:ffff88003f400000(0000) knlGS:0000000000000000 [459395.776651] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [459395.776656] CR2: 00007ffe0bc9bf98 CR3: 000000003ca46000 CR4: 00000000000407f0 [459395.776658] Stack: [459395.776661] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [459395.776666] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [459395.776670] 0000000000000000 0000000000000000 0000000000000000 [459395.776674] Call Trace: [459395.776676] <UNK> [459395.776678] Code: [459395.776680] ad 8c 4e 00 be 04 00 03 00 eb a8 90 66 66 66 66 90 41 57 41 56 41 55 41 54 49 89 d4 55 53 48 89 fd 48 89 f3 48 81 ec c8 00 00 00 <65> 48 8b 04 25 28 00 00 00 48 89 84 24 c0 00 00 00 31 c0 65 48 [459395.776716] Kernel panic - not syncing: Machine halted. [459395.777172] CPU: 0 PID: 36565 Comm: fio Not tainted 3.16.39kmemleak #4 [459395.777673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083029-gandalf 04/01/2014 [459395.778373] 0000000000000086 00000000d85f6336 ffffffff81532ec9 ffffffff8170203e [459395.779865] ffff88003f402f18 ffffffff815318a1 00007ff800000008 ffff88003f402f28 [459395.780061] ffff88003f402ec0 00000000d85f6336 00007fff00000008 0000000000000046 [459395.780061] Call Trace: [459395.780061] <#DF> [<ffffffff81532ec9>] ? dump_stack+0x47/0x5a [459395.780061] [<ffffffff815318a1>] ? panic+0xcf/0x206 [459395.780061] [<ffffffff8104234d>] ? df_debug+0x2d/0x30 [459395.780061] [<ffffffff81004f78>] ? do_double_fault+0x78/0xf0 [459395.780061] [<ffffffff8153a4f2>] ? double_fault+0x22/0x30 [459395.780061] [<ffffffff81539927>] ? native_iret+0x7/0x7 [459395.780061] [<ffffffff81048ecf>] ? __do_page_fault+0x1f/0x540 we found out that all kernel backtraces have userspace RSP, where userspace memory has normal timer, page fault or virtio interrupts trail: (the following RSP pointer does not belong to this particular crash above, but it does not matter, symptoms are always the same) crash> rd -s 00007f6cb9556768 100 7f6cb9556768: 00007f6cfaa21270 00007f6cfaa21270 7f6cb9556778: 0000000000000000 00007f6cf9b8c6a0 7f6cb9556788: 00007f6cf983399a 0000000000000000 7f6cb9556798: 0000000000000000 00007f6cf98a1f2d 7f6cb95567a8: 00007f6cfaa21270 0000000000000000 7f6cb95567b8: 00007f6ca4031880 ffffffffffffff7e IRQ, ~0xffffffffffffff7e = 0x81 7f6cb95567c8: 00007f6cfa817ae1 0000000000000033 RIP; CS 7f6cb95567d8: 0000000000000202 00007f6cb95567f0 EFLAGS; RSP 7f6cb95567e8: 000000000000002b SS 00007f6cb00318e0 7f6cb95567f8: 00007f6cfa817af5 00007f6cac0318e0 7f6cb9556808: 00007f6cfa817af5 00007f6cb4031880 7f6cb9556818: 00007f6cfa817af5 00007f6cfbe82340 7f6cb9556828: 00007f6cfa817af5 00007f6c940318c0 7f6cb9556838: 00007f6cfa817af5 00007f6ca00318c0 7f6cb9556848: 00007f6cfa817af5 00007f6c9c031920 7f6cb9556858: 00007f6cfa817af5 00007f6ca8031920 7f6cb9556868: 00007f6cfa817af5 00007f6ca4039df0 7f6cb9556878: 00007f6cfa817af5 00007f6cb0039e50 7f6cb9556888: 00007f6cfa817af5 00007f6cac039e50 It turned out to be that CPU does not change SS:RSP and took interrupt on userspace stack (BTW init_tss and gdb_page are not corrupted). That is completely weird. Next step was to trace VMCB before and after VMRUN to understand the exact state seen by real CPU. VMCB was traced when wrong CPU state is observed: RIP is to kernel and RSP is from userspace. The following is a diff of VMCB, where --- is the state before VMRUN and +++ is the state after VMRUN: - event_inj = 0x80000081, + event_inj = 0x0, ... cs = { - selector = 0x33, - attrib = 0xafb, + selector = 0x10, + attrib = 0x29b, limit = 0xffffffff, base = 0x0 }, ss = { selector = 0x2b, attrib = 0x0, limit = 0xffffffff, base = 0x0 }, ... cpl = 0x0, <<<<<< WTF? ... - rip = 0x7f6cfa817ae1, + rip = 0xffffffff8152b690, ... - rsp = 0x7f6cb95567f0, + rsp = 0x7f6cb9556768, The execution scenario is the following: 1. Userspace code was suspended. 2. Virtio 0x80000081 interrupt was injected and VMRUN called again. 3. HW CPU took virtual interrupt, but did not switch the stack because CPL *was* already set to 0. 4. KABOOOOOM. So the stack switch does not happen because "the processor performs an automatic stack switch when a control transfer causes a change in privilege levels to occur" (AMD manual), and in this case privilege level was not changed. The question remains who and why changed CPL to 0? Further tracking of VMCB states before and after VMRUN showed, that CPL becomes 0 when VMEXIT happens with the following SS segment: ss = { selector = 0x2b, attrib = 0x400, limit = 0xffffffff, base = 0x0 }, cpl = 0x3 Then on next VMRUN VMCB looks as the following: ss = { selector = 0x2b, attrib = 0x0, <<< dropped to 0 limit = 0xffffffff, base = 0x0 }, cpl = 0x0, <<< dropped to 0 Obviously it was changed between VMRUN calls. The following backtrace shows that VMCB.SAVE.CPL was set to 0 by QEMU itself: CPU: 55 PID: 59531 Comm: kvm [<ffffffffa00a3a20>] kvm_arch_vcpu_ioctl_set_sregs+0x2e0/0x480 [kvm] [<ffffffffa008ddf0>] kvm_write_guest_cached+0x540/0xc00 [kvm] [<ffffffff8107d695>] ? finish_task_switch+0x185/0x240 [<ffffffff8180097c>] ? __schedule+0x28c/0xa10 [<ffffffff811a9aad>] do_vfs_ioctl+0x2cd/0x4a0 SS segment which came from QEMU had the following struct members: SS->base = 0 SS->limit = ffffffff SS->selector = 2b SS->type = 0 SS->present = 0 SS->dpl = 0 SS->db = 0 SS->s = 0 SS->l = 0 SS->g = 0 SS->avl = 0 SS->unusable = 1 Indeed, on last VMEXIT SS segment does not have (P) present bit set in segment attributes: (gdb) p 0x400 & (1 << SVM_SELECTOR_P_SHIFT) $1 = 0
Huh? How is that even possible? It should not be possible to actually run the vCPU with a non-NULL SS that isn't present. How would you cause it to happen?
Unless... is this the sysret_ss_attrs issue? This is a 3.16.something guest, and maybe the sysret_ss_attrs code never got backported. I bet that the user code running when the virtio interrupt hits is in the vDSO.
So when on VMEXIT we have such SS state (P bit is not set) and QEMU just decides to synchronize registers the following happens on QEMU side: kvm_cpu_synchronize_state(): kvm_arch_get_registers(): ... get_seg(): if (rhs->unusable) { lhs->flags = 0; <<< SS is unusable [(P) is not set) <<< all attributes are dropped to 0. } cpu->kvm_vcpu_dirty = true; <<< Mark VCPU state as dirty
Looks like the bug is in QEMU, then, right? Couldn't you just fix this code in QEMU by, say, deleting it? If it's actually needed for some reason, then at least sanitize the flags correctly rather than corrupting the DPL. (I'm guessing the real issue here is that migration from AMD to Intel fails without this, but migration from AMD to Intel is highly dubious regardless.)
--Andy