Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present

Roman Penyaev <roman.penyaev@xxxxxxxxxxxxxxxx> · Sun, 21 May 2017 09:53:58 +0200

On Sun, May 21, 2017 at 5:31 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> On 05/19/2017 09:14 AM, Roman Penyaev wrote:
>>
>> Hi folks,
>>
>> After experiencing guest double faults (sometimes triple faults) on
>> 3.16 guest kernels with the following common pattern:
>>

[cut]

>>
>> Further tracking of VMCB states before and after VMRUN showed, that
>> CPL becomes 0 when VMEXIT happens with the following SS segment:
>>
>>            ss = {
>>              selector = 0x2b,
>>              attrib = 0x400,
>>              limit = 0xffffffff,
>>              base = 0x0
>>            },
>>
>>            cpl = 0x3
>>
>> Then on next VMRUN VMCB looks as the following:
>>
>>            ss = {
>>              selector = 0x2b,
>>              attrib = 0x0,            <<< dropped to 0
>>              limit = 0xffffffff,
>>              base = 0x0
>>            },
>>
>>            cpl = 0x0,                 <<< dropped to 0
>>
>> Obviously it was changed between VMRUN calls.  The following backtrace
>> shows that VMCB.SAVE.CPL was set to 0 by QEMU itself:
>>
>>    CPU: 55 PID: 59531 Comm: kvm
>>    [<ffffffffa00a3a20>] kvm_arch_vcpu_ioctl_set_sregs+0x2e0/0x480 [kvm]
>>    [<ffffffffa008ddf0>] kvm_write_guest_cached+0x540/0xc00 [kvm]
>>    [<ffffffff8107d695>] ? finish_task_switch+0x185/0x240
>>    [<ffffffff8180097c>] ? __schedule+0x28c/0xa10
>>    [<ffffffff811a9aad>] do_vfs_ioctl+0x2cd/0x4a0
>>
>> SS segment which came from QEMU had the following struct members:
>>
>>         SS->base      = 0
>>         SS->limit     = ffffffff
>>         SS->selector  = 2b
>>         SS->type      = 0
>>         SS->present   = 0
>>         SS->dpl       = 0
>>         SS->db        = 0
>>         SS->s         = 0
>>         SS->l         = 0
>>         SS->g         = 0
>>         SS->avl       = 0
>>         SS->unusable  = 1
>>
>> Indeed, on last VMEXIT SS segment does not have (P) present bit set in
>> segment attributes:
>>
>>     (gdb) p 0x400 & (1 << SVM_SELECTOR_P_SHIFT)
>>     $1 = 0
>
>
> Huh?  How is that even possible?  It should not be possible to actually run
> the vCPU with a non-NULL SS that isn't present.

That is utterly good question :)  I do not know.  According to my shallow
understanding (P) bit is only a hint for CPU that corresponding segment was
read from gdt and now is cached in private CPU registers (attributes).
Am I right?

At least what I see that it is quite often the case when we exit from VMRUN
with segment not present then VMRUN is resumed and on next vmexit segment has
correct attributes.

> How would you cause it to happen?

We run fio and iperf tests in guests for a couple of days.  Nothing more,
nothing special.  Guests are bare debians with 3.16 kernels.

>
> Unless... is this the sysret_ss_attrs issue?

What is the issue?  This one

https://lkml.org/lkml/2015/4/24/770

??

We tried to disassemble interrupted userspace code to find any visible
pattern - no syscalls around.

> This is a 3.16.something
> guest, and maybe the sysret_ss_attrs code never got backported.
> I bet that the user code running when the virtio interrupt hits is in
> the vDSO.

No, user code is interrupted in random places, e.g. the following is the
print from a driver which intercepts kvm_vcpu_run() and catches wrong
CPL when user code was interrupted:

    @@@@@ [63208] WTF? RIP=83f2a0, RSP=7fffc630e0c0, CPL=0, fix CPL=3
    @@@@@ [63205] WTF? RIP=9ae258, RSP=7ffe49f6a260, CPL=0, fix CPL=3
    @@@@@ [63215] WTF? RIP=dadee1, RSP=7ffc9ee1b508, CPL=0, fix CPL=3
    @@@@@ [63217] WTF? RIP=83f89c, RSP=7ffd6a1287e0, CPL=0, fix CPL=3
    @@@@@ [21521] WTF? RIP=7fa85b68e1e7, RSP=7ffd6d408ab0, CPL=0, fix CPL=3

As you can see sometimes RIP belongs to lower addresses, but vDSO mappings
are higher and start from 7ff.

>>
>> So when on VMEXIT we have such SS state (P bit is not set) and QEMU
>> just decides to synchronize registers the following happens on QEMU
>> side:
>>
>>     kvm_cpu_synchronize_state():
>>         kvm_arch_get_registers():
>>             ...
>>             get_seg():
>>                if (rhs->unusable) {
>>                    lhs->flags = 0;    <<< SS is unusable [(P) is not set)
>>                                       <<< all attributes are dropped to 0.
>>                 }
>>         cpu->kvm_vcpu_dirty = true;   <<< Mark VCPU state as dirty
>>
>
> Looks like the bug is in QEMU, then, right?

KVM SVM restores CPL from unusable selector, obviously this is not nice.

arch/x86/kvm/svm.c:svm_set_segment():

   if (var->unusable)
      s->attrib = 0;
   ...
   if (seg == VCPU_SREG_SS)
       svm->vmcb->save.cpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3;

Meanwhile QEMU resets attributes, despite the fact that DPL (which is passed
from KVM) is correct.

So it is not clear what is the proper way to fix that.   What is clear is
that CPL is set to 0 because of this game with registers on both sides.
Now the question is what side to fix or probably both.

> Couldn't you just fix this code
> in QEMU by, say, deleting it?

Certainly, but would be nice to listen to KVM maintainers.  At least the issue
is clear and what is left is a proper one-line fix :)

--
Roman

> If it's actually needed for some reason, then
> at least sanitize the flags correctly rather than corrupting the DPL.  (I'm
> guessing the real issue here is that migration from AMD to Intel fails
> without this, but migration from AMD to Intel is highly dubious regardless.)
>
> --Andy