On Thu, Feb 18, 2021 at 11:21 AM Joerg Roedel <jroedel@xxxxxxx> wrote: > > On Thu, Feb 18, 2021 at 09:49:06AM -0800, Andy Lutomirski wrote: > > I don't understand what this means. The whole entry mechanism on x86 > > is structured so that we call a C function *and return from that C > > function without longjmp-like magic* with the sole exception of > > unwind_stack_do_exit(). This means that you can match up enters and > > exits, and that unwind_stack_do_exit() needs to unwind correctly. In > > the former case, it's normal C and we can use normal local variables. > > In the latter case, we know exactly what state we're trying to restore > > and we can restore it directly without any linked lists or similar. > > Okay, the unwinder will likely get confused by this logic. > > > What do you have in mind that requires a linked list? > > Cases when there are multiple IST vectors besides NMI that can hit while > the #VC handler is still on its own IST stack. #MCE comes to mind, but > that is broken anyway. At some point #VC itself will be one of them, but > when that happens the code will kill the machine. > This leaves #HV in the list, and I am not sure how that is going to be > handled yet. I think the goal is that the #HV handler is not allowed to > cause any #VC exception. In that case the linked-list logic will not be > needed. Can you give me an example, even artificial, in which the linked-list logic is useful? > > > > I don't see how this would break, can you elaborate? > > > > > > What I think happens is: > > > > > > SYSCALL gap (RSP is from userspace and untrusted) > > > > > > -> #VC - Handler on #VC IST stack detects that it interrupted > > > the SYSCALL gap and switches to the task stack. > > > > > > > Can you point me to exactly what code you're referring to? I spent a > > while digging through the code and macro tangle and I can't find this. > > See the entry code in arch/x86/entry/entry_64.S, macro idtentry_vc. It > creates the assembly code for the handler. At some point it calls > vc_switch_off_ist(), which is a C function in arch/x86/kernel/traps.c. > This function tries to find a new stack for the #VC handler. > > The first thing it does is checking whether the exception came from the > SYSCALL gap and just uses the task stack in that case. > > Then it will check for other kernel stacks which are safe to switch > to. If that fails it uses the fall-back stack (VC2), which will direct > the handler to a separate function which, for now, just calls panic(). > Not safe are the entry or unknown stacks. Can you explain your reasoning in considering the entry stack unsafe? It's 4k bytes these days. You forgot about entry_SYSCALL_compat. Your 8-byte alignment is confusing to me. In valid kernel code, SP should be 8-byte-aligned already, and, if you're trying to match architectural behavior, the CPU aligns to 16 bytes. We're not robust against #VC, NMI in the #VC prologue before the magic stack switch, and a new #VC in the NMI prologue. Nor do we appear to have any detection of the case where #VC nests directly inside its own prologue. Or did I miss something else here? If we get NMI and get #VC in the NMI *asm*, the #VC magic stack switch looks like it will merrily run itself in the NMI special-stack-layout section, and that sounds really quite bad. > > The function then copies pt_regs and returns the new stack pointer to > assembly code, which then writes it to %RSP. > > > Unless AMD is more magic than I realize, the MOV SS bug^Wfeature means > > that #DB is *not* always called in safe places. > > You are right, forgot about this. The MOV SS bug can very well > trigger a #VC(#DB) exception from the syscall gap. > > > > And with SNP we need to be able to at least detect a malicious HV so we > > > can reliably kill the guest. Otherwise the HV could potentially take > > > control over the guest's execution flow and make it reveal its secrets. > > > > True. But is the rest of the machinery to be secure against EFLAGS.IF > > violations and such in place yet? > > Not sure what you mean by EFLAGS.IF violations, probably enabling IRQs > while in the #VC handler? The #VC handler _must_ _not_ enable IRQs > anywhere in its call-path. If that ever happens it is a bug. > I mean that, IIRC, a malicious hypervisor can inject inappropriate vectors at inappropriate times if the #HV mechanism isn't enabled. For example, it could inject a page fault or an interrupt in a context in which we have the wrong GSBASE loaded. But the #DB issue makes this moot. We have to use IST unless we turn off SCE. But I admit I'm leaning toward turning off SCE until we have a solution that seems convincingly robust.