On Mon, Apr 29, 2019 at 11:29 AM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Mon, Apr 29, 2019 at 11:06 AM Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > > > It does *not* emulate the "call" in the BP handler itself, instead if > > replace the %ip (the same way all the other BP handlers replace the > > %ip) with a code sequence that just does > > > > push %gs:bp_call_return > > jmp *%gs:bp_call_target > > > > after having filled in those per-cpu things. > > Note that if you read the patch, you'll see that my explanation > glossed over the "what if an interrupt happens" part. Which is handled > by having two handlers, one for "interrupts were already disabled" and > one for "interrupts were enabled, so I disabled them before entering > the handler". This is quite a cute solution. > > The second handler does the same push/jmp sequence, but has a "sti" > before the jmp. Because of the one-instruction sti shadow, interrupts > won't actually be enabled until after the jmp instruction has > completed, and thus the "push/jmp" is atomic wrt regular interrupts. > > It's not safe wrt NMI, of course, but since NMI won't be rescheduling, > and since any SMP IPI won't be punching through that sequence anyway, > it's still atomic wrt _another_ text_poke() attempt coming in and > re-using the bp_call_return/tyarget slots. I'm less than 100% convinced about this argument. Sure, an NMI right there won't cause a problem. But an NMI followed by an interrupt will kill us if preemption is on. I can think of three solutions: 1. Assume that all CPUs (and all relevant hypervisors!) either mask NMIs in the STI shadow or return from NMIs with interrupts masked for one instruction. Ditto for MCE. This seems too optimistic. 2. Put a fixup in the NMI handler and MCE handler: if the return address is one of these magic jumps, clear IF and back up IP by one byte. This should *work*, but it's a bit ugly. 3. Use a different magic sequence: push %gs:bp_call_return int3 and have the int3 handler adjust regs->ip and regs->flags as appropriate. I think I like #3 the best, even though it's twice as slow. FWIW, kernel shadow stack patches will show up eventually, and none of these approaches are compatible as is. Perhaps the actual sequence should be this, instead: bp_int3_fixup_irqsoff: call 1f 1: int3 bp_int3_fixup_irqson: call 1f 1: int3 and the int3 handler will update the conventional return address *and* the shadow return address. Linus, what do you think about this variant? Finally, with my maintainer hat on: if anyone actually wants to merge this thing, I want to see a test case, exercised regularly (every boot in configured, perhaps) that actually *runs* this code. Hitting it in practice will be rare, and I want bugs to be caught.