Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present

Andy Lutomirski <luto@xxxxxxxxxx> · Sat, 20 May 2017 20:31:42 -0700

On 05/19/2017 09:14 AM, Roman Penyaev wrote:
Hi folks,

After experiencing guest double faults (sometimes triple faults) on
3.16 guest kernels with the following common pattern:

[459395.776124] PANIC: double fault, error_code: 0x0
[459395.776606] CPU: 0 PID: 36565 Comm: fio Not tainted 3.16.39kmemleak #4
[459395.776610] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140531_083029-gandalf 04/01/2014
[459395.776614] task: ffff880009ca06b0 ti: ffff88003cbc2000 task.ti:
ffff88003cbc2000
[459395.776617] RIP: 0010:[<ffffffff81048ecf>]  [<ffffffff81048ecf>]
__do_page_fault+0x1f/0x540
[459395.776628] RSP: 002b:00007ffe0bc9bfa8  EFLAGS: 00010012
[459395.776631] RAX: 0000000081539927 RBX: 0000000000000000 RCX:
ffffffff81539927
[459395.776634] RDX: 0000000000000028 RSI: 0000000000000000 RDI:
00007ffe0bc9c0a8
[459395.776637] RBP: 00007ffe0bc9c0a8 R08: 0001a1d1002e9400 R09:
0000000000063f1b
[459395.776640] R10: 0000000033f8356c R11: 000029c8250c3103 R12:
0000000000000028
[459395.776642] R13: 00007ff8c83e0000 R14: 0000000000000000 R15:
00007ffe0bc9c7c0
[459395.776649] FS:  00007ff8d2aaa7c0(0000) GS:ffff88003f400000(0000)
knlGS:0000000000000000
[459395.776651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[459395.776656] CR2: 00007ffe0bc9bf98 CR3: 000000003ca46000 CR4:
00000000000407f0
[459395.776658] Stack:
[459395.776661]  0000000000000000 0000000000000000 0000000000000000
0000000000000000
[459395.776666]  0000000000000000 0000000000000000 0000000000000000
0000000000000000
[459395.776670]  0000000000000000 0000000000000000 0000000000000000
[459395.776674] Call Trace:
[459395.776676]  <UNK>
[459395.776678] Code:
[459395.776680] ad 8c 4e 00 be 04 00 03 00 eb a8 90 66 66 66 66 90 41
57 41 56 41 55 41 54 49 89 d4 55 53 48 89 fd 48 89 f3 48 81 ec c8 00
00 00 <65> 48 8b 04 25 28 00 00 00 48 89 84 24 c0 00 00 00 31 c0 65 48
[459395.776716] Kernel panic - not syncing: Machine halted.
[459395.777172] CPU: 0 PID: 36565 Comm: fio Not tainted 3.16.39kmemleak #4
[459395.777673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140531_083029-gandalf 04/01/2014
[459395.778373]  0000000000000086 00000000d85f6336 ffffffff81532ec9
ffffffff8170203e
[459395.779865]  ffff88003f402f18 ffffffff815318a1 00007ff800000008
ffff88003f402f28
[459395.780061]  ffff88003f402ec0 00000000d85f6336 00007fff00000008
0000000000000046
[459395.780061] Call Trace:
[459395.780061]  <#DF>  [<ffffffff81532ec9>] ? dump_stack+0x47/0x5a
[459395.780061]  [<ffffffff815318a1>] ? panic+0xcf/0x206
[459395.780061]  [<ffffffff8104234d>] ? df_debug+0x2d/0x30
[459395.780061]  [<ffffffff81004f78>] ? do_double_fault+0x78/0xf0
[459395.780061]  [<ffffffff8153a4f2>] ? double_fault+0x22/0x30
[459395.780061]  [<ffffffff81539927>] ? native_iret+0x7/0x7
[459395.780061]  [<ffffffff81048ecf>] ? __do_page_fault+0x1f/0x540

we found out that all kernel backtraces have userspace RSP, where
userspace memory has normal timer, page fault or virtio interrupts
trail:

(the following RSP pointer does not belong to this particular crash
  above, but it does not matter, symptoms are always the same)

crash> rd -s 00007f6cb9556768 100
     7f6cb9556768:  00007f6cfaa21270 00007f6cfaa21270
     7f6cb9556778:  0000000000000000 00007f6cf9b8c6a0
     7f6cb9556788:  00007f6cf983399a 0000000000000000
     7f6cb9556798:  0000000000000000 00007f6cf98a1f2d
     7f6cb95567a8:  00007f6cfaa21270 0000000000000000
     7f6cb95567b8:  00007f6ca4031880
                                     ffffffffffffff7e  IRQ,
~0xffffffffffffff7e = 0x81
     7f6cb95567c8:  00007f6cfa817ae1 0000000000000033  RIP; CS
     7f6cb95567d8:  0000000000000202 00007f6cb95567f0  EFLAGS; RSP
     7f6cb95567e8:  000000000000002b                   SS
                                     00007f6cb00318e0
     7f6cb95567f8:  00007f6cfa817af5 00007f6cac0318e0
     7f6cb9556808:  00007f6cfa817af5 00007f6cb4031880
     7f6cb9556818:  00007f6cfa817af5 00007f6cfbe82340
     7f6cb9556828:  00007f6cfa817af5 00007f6c940318c0
     7f6cb9556838:  00007f6cfa817af5 00007f6ca00318c0
     7f6cb9556848:  00007f6cfa817af5 00007f6c9c031920
     7f6cb9556858:  00007f6cfa817af5 00007f6ca8031920
     7f6cb9556868:  00007f6cfa817af5 00007f6ca4039df0
     7f6cb9556878:  00007f6cfa817af5 00007f6cb0039e50
     7f6cb9556888:  00007f6cfa817af5 00007f6cac039e50

It turned out to be that CPU does not change SS:RSP and took interrupt
on userspace stack (BTW init_tss and gdb_page are not corrupted).
That is completely weird.

Next step was to trace VMCB before and after VMRUN to understand the
exact state seen by real CPU.  VMCB was traced when wrong CPU state is
observed: RIP is to kernel and RSP is from userspace.  The following
is a diff of VMCB, where
--- is the state before VMRUN and
+++ is the state after VMRUN:

  -      event_inj = 0x80000081,
  +      event_inj = 0x0,

         ...

         cs = {
  -        selector = 0x33,
  -        attrib = 0xafb,
  +        selector = 0x10,
  +        attrib = 0x29b,
           limit = 0xffffffff,
           base = 0x0
         },
         ss = {
           selector = 0x2b,
           attrib = 0x0,
           limit = 0xffffffff,
           base = 0x0
         },

         ...

         cpl = 0x0,              <<<<<< WTF?

         ...

  -      rip = 0x7f6cfa817ae1,
  +      rip = 0xffffffff8152b690,

         ...

  -      rsp = 0x7f6cb95567f0,
  +      rsp = 0x7f6cb9556768,

The execution scenario is the following:

   1. Userspace code was suspended.
   2. Virtio 0x80000081 interrupt was injected and VMRUN called again.
   3. HW CPU took virtual interrupt, but did not switch the stack because
      CPL *was* already set to 0.
   4. KABOOOOOM.

So the stack switch does not happen because "the processor performs an
automatic stack switch when a control transfer causes a change in
privilege levels to occur" (AMD manual), and in this case privilege
level was not changed.

The question remains who and why changed CPL to 0?

Further tracking of VMCB states before and after VMRUN showed, that
CPL becomes 0 when VMEXIT happens with the following SS segment:

           ss = {
             selector = 0x2b,
             attrib = 0x400,
             limit = 0xffffffff,
             base = 0x0
           },

           cpl = 0x3

Then on next VMRUN VMCB looks as the following:

           ss = {
             selector = 0x2b,
             attrib = 0x0,            <<< dropped to 0
             limit = 0xffffffff,
             base = 0x0
           },

           cpl = 0x0,                 <<< dropped to 0

Obviously it was changed between VMRUN calls.  The following backtrace
shows that VMCB.SAVE.CPL was set to 0 by QEMU itself:

   CPU: 55 PID: 59531 Comm: kvm
   [<ffffffffa00a3a20>] kvm_arch_vcpu_ioctl_set_sregs+0x2e0/0x480 [kvm]
   [<ffffffffa008ddf0>] kvm_write_guest_cached+0x540/0xc00 [kvm]
   [<ffffffff8107d695>] ? finish_task_switch+0x185/0x240
   [<ffffffff8180097c>] ? __schedule+0x28c/0xa10
   [<ffffffff811a9aad>] do_vfs_ioctl+0x2cd/0x4a0

SS segment which came from QEMU had the following struct members:

        SS->base      = 0
        SS->limit     = ffffffff
        SS->selector  = 2b
        SS->type      = 0
        SS->present   = 0
        SS->dpl       = 0
        SS->db        = 0
        SS->s         = 0
        SS->l         = 0
        SS->g         = 0
        SS->avl       = 0
        SS->unusable  = 1

Indeed, on last VMEXIT SS segment does not have (P) present bit set in
segment attributes:

    (gdb) p 0x400 & (1 << SVM_SELECTOR_P_SHIFT)
    $1 = 0

Huh?  How is that even possible?  It should not be possible to actually 
run the vCPU with a non-NULL SS that isn't present.  How would you cause 
it to happen?

Unless... is this the sysret_ss_attrs issue?  This is a 3.16.something 
guest, and maybe the sysret_ss_attrs code never got backported.  I bet 
that the user code running when the virtio interrupt hits is in the vDSO.

So when on VMEXIT we have such SS state (P bit is not set) and QEMU
just decides to synchronize registers the following happens on QEMU
side:

    kvm_cpu_synchronize_state():
        kvm_arch_get_registers():
            ...
            get_seg():
               if (rhs->unusable) {
                   lhs->flags = 0;    <<< SS is unusable [(P) is not set)
                                      <<< all attributes are dropped to 0.
                }
        cpu->kvm_vcpu_dirty = true;   <<< Mark VCPU state as dirty

Looks like the bug is in QEMU, then, right?  Couldn't you just fix this 
code in QEMU by, say, deleting it?  If it's actually needed for some 
reason, then at least sanitize the flags correctly rather than 
corrupting the DPL.  (I'm guessing the real issue here is that migration 
from AMD to Intel fails without this, but migration from AMD to Intel is 
highly dubious regardless.)

--Andy