On Tue, Apr 30, 2019 at 9:06 AM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Tue, Apr 30, 2019 at 6:56 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > Realistically, I don't think you can hit the problem in practice. The > only way to hit that incredibly small race of "one instruction, *both* > NMI and interrupts" is to have a lot of interrupts going all at the > same time, but that will also then solve the latency problem, so the > very act of triggering it will also fix it. > > I don't see any case where it's really bad. The "sti sysexit" race is > similar, just about latency of user space signal reporting (and > perhaps any pending TIF_WORK_xyz flags). In the worst case, it actually kills the machine. Last time I tracked a bug like this down, I think the issue was that we got preempted after the last TIF_ check, entered a VM, exited, context switched back, and switched to user mode without noticing that there was a ending KVM user return notifier. This left us with bogus CPU state and the machine exploded. Linus, can I ask you to reconsider your opposition to Josh's other approach of just shifting the stack on int3 entry? I agree that it's ugly, but the ugliness is easily manageable and fairly self-contained. We add a little bit of complication to the entry asm (but it's not like it's unprecedented -- the entry asm does all kinds of stack rearrangement due to IST and PTI crap already), and we add an int3_emulate_call(struct pt_regs *regs, unsigned long target) helper that has appropriate assertions that the stack is okay and emulates the call. And that's it. In contrast, your approach involves multiple asm trampolines, hash tables, batching complications, and sti shadows. As an additional argument, with the stack-shifting approach, it runs on *every int3 from kernel mode*. This means that we can do something like this: static bool int3_emulate_call_okay(struct pt_regs *regs) { unsigned long available_stack = regs->sp - (unsigned long); return available_stack >= sizeof(long); } void do_int3(...) { { WARN_ON_ONCE(!user_mode(regs) && !int3_emulate_call_okay(regs)); ...; } static void int3_emulate_call(struct pt_regs *regs, unsigned long target) { BUG_ON(user_mode(regs) || !int3_emulate_call_okey(regs)); regs->sp -= sizeof(unsigned long); *(unsigned long *)regs->sp = target; /* CET SHSTK fixup goes here */ } Obviously the CET SHSTK fixup might be rather nasty, but I suspect it's a solvable problem. A major benefit of this is that the entry asm nastiness will get exercised all the time, and, if we screw it up, the warning will fire. This is the basic principle behind why the entry stuff *works* these days. I've put a lot of effort into making sure that running kernels with CONFIG_DEBUG_ENTRY and running the selftests actually exercises the nasty cases. --Andy