Re: Yet another KPTI regression with 4.14.x series in a VM

Andy Lutomirski <luto@xxxxxxxxxx> · Fri, 12 Jan 2018 22:08:20 -0800

On Fri, Jan 12, 2018 at 1:51 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> On Fri, 12 Jan 2018, Laura Abbott wrote:
>
> Cc+ Andy
>
> I'm almost crashed out by now. Andy might have an idea. I'll look again
> tomorrow with brain awake.
>
>> On 01/12/2018 10:51 AM, Thomas Gleixner wrote:
>> > On Fri, 12 Jan 2018, Laura Abbott wrote:
>> > > Fedora got a bug report on 4.14.11 of a panic when booting a
>> > > Fedora guest in a CentOS 6 VM, not reproducible with nopti.
>> > > The issue is still present as of 4.14.13 as well. The only
>> > > report is a panic screenshot
>> > > https://bugzilla.redhat.com/show_bug.cgi?id=1532458
>> > >
>> > > I've lost track of all the fixes that have been flying around,
>> > > is this a new issue or has a fix not yet made it to stable?
>> >
>> > Hmm. Looks kinda familiar, but that has been fixed I think even before
>> > 4.4.11. Could you please ask the reported to provide a full console log via
>> > the VM "serial console" ?
>> >
>> > Thanks,
>> >
>> >     tglx
>> >
>>

>> [    2.910025] PANIC: double fault, error_code: 0x0

Probably a stack overflow

>> [    2.910025] CPU: 1 PID: 56 Comm: modprobe Not tainted
>> 4.14.13-300.fc27.x86_64 #1
>> [    2.910025] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
>> [    2.910025] task: ffff891f78dc3c00 task.stack: ffffa6eac0594000
>> [    2.910025] RIP: 0010:vprintk_default+0x5/0x30
>> [    2.910025] RSP: 0000:fffffe000002e000 EFLAGS: 00010046
>> [    2.910025] RAX: 0000000000000000 RBX: fffffe000002e118 RCX:
>> 0000000000000001
>> [    2.910025] RDX: 0000000000000000 RSI: fffffe000002e018 RDI:
>> ffffffffbe0715a0
>> [    2.910025] RBP: fffffe000002e008 R08: ffffffffbe0bb565 R09:
>> ffffffffbe07159b
>> [    2.910025] R10: fffffe000002e080 R11: 0000000000000000 R12:
>> ffffffffbe070fdd
>> [    2.910025] R13: 0000000000000000 R14: 0000000000000000 R15:
>> 0000000000000000
>> [    2.910025] FS:  0000000000000000(0000) GS:ffff891f7fd00000(0000)
>> knlGS:0000000000000000
>> [    2.910025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [    2.910025] CR2: fffffe000002dff8 CR3: 0000000078da6000 CR4:
>> 00000000000006e0

CR2 and RSP are consistent with stack overflow, but that's not
terribly important, since...

>> [    2.910025] Call Trace:
>> [    2.910025]        <ENTRY_TRAMPOLINE>
>> [    2.910025]        ? vprintk_func+0x27/0x60
>> [    2.910025]        printk+0x52/0x6e
>> [    2.910025]        __die+0x6b/0xe0
>> [    2.910025]        die+0x2f/0x50
>> [    2.910025]        do_general_protection+0x149/0x160

This is the real problem here.

>> [    2.910025]        general_protection+0x2c/0x60
>> [    2.910025] RIP: 0010:swapgs_restore_regs_and_return_to_usermode+0x6f/0x80
>> [    2.910025] RSP: 0000:fffffe000002e1c8 EFLAGS: 00000006
>> [    2.910025] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>> 0000000000000000
>> [    2.910025] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>> 0000000078da7800
>> [    2.910025] RBP: 0000000000000000 R08: 0000000000000000 R09:
>> 0000000000000000
>> [    2.910025] R10: 0000000000000000 R11: 0000000000000000 R12:
>> 0000000000000000
>> [    2.910025] R13: 0000000000000000 R14: 0000000000000000 R15:
>> 0000000000000000
>> [    2.910025]        </ENTRY_TRAMPOLINE>

ffffffff81a00b3e:       65 48 0f b3 3c 25 8e    btr    %rdi,%gs:0x1e168e
ffffffff81a00b45:       16 1e 00
ffffffff81a00b48:       48 89 c7                mov    %rax,%rdi
ffffffff81a00b4b:       eb 08                   jmp    0xffffffff81a00b55
ffffffff81a00b4d:       48 89 c7                mov    %rax,%rdi
ffffffff81a00b50:       48 0f ba ef 3f          bts    $0x3f,%rdi
ffffffff81a00b55:       48 81 cf 00 18 00 00    or     $0x1800,%rdi
ffffffff81a00b5c:       0f 22 df                mov    %rdi,%cr3  <-- #GP here
ffffffff81a00b5f:       58                      pop    %rax
ffffffff81a00b60:       5f                      pop    %rdi

The CPU didn't like writing 0x0000000078da7800 to %cr3.

I'm guessing that CR4.PCIDE=0.

Now this is quite a strange value to write to CR3.  The 0x800 part
means that we're using the "user" variant of the address space that
would have ASID=0 and the 0x1000 bit being set corresponds to the user
pgdir, but this is nonsense, since the kernel never uses PCID 0 for
user mode.  We always start at 1.  The only exception is if
X86_FEATURE_PCID is off.  But, if X86_FEATURE_PCID is off, then we
shouldn't be setting any PCID bits.

In fact, it looks like this code is totally bogus and has never been
correct at all.  Even in:

commit 4b1d5ae3b103eda43f9d0f85c355bb6995b03a30
Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date:   Mon Dec 4 15:07:59 2017 +0100

    x86/mm: Use/Fix PCID to optimize user/kernel switches

We have:

.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
        ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
        mov     %cr3, \scratch_reg

        ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID

...

.Lwrcr3_\@:
        /* Flip the PGD and ASID to the user version */
        orq     $(PTI_SWITCH_MASK), \scratch_reg
        mov     \scratch_reg, %cr3
.Lend_\@:

That's bogus.  PTI_SWITCH_MASK is 0x1800, which has PCID = 0x800.

This should probably use an alternative to select between 0x1000 and
0x800 depending on X86_FEATURE_PCID or just use an entirely different
label for the !PCID case.

FWIW, this bit in SAVE_AND_SWITCH_TO_KERNEL_CR3

        testq   $(PTI_SWITCH_MASK), \scratch_reg
        jz      .Ldone_\@

is a bit silly, too.  It's *correct* (I think), but shouldn't that
just be bt $(PTI_SWITCH_PGTABLES_BIT), \scratch_reg, with the obvious
caveat that the headers don't actually define PTI_SWITCH_PGTABLES_BIT?

Was this code *ever* tested with nopcid?