On Mon, May 23, 2016 at 02:34:56PM -0700, Andy Lutomirski wrote: > On Thu, May 19, 2016 at 4:15 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote: > > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote: > >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote: > >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: > >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@xxxxxxxxxx> wrote: > >> >> > > >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote: > >> >> > > >> I suppose we could try to rejigger the code so that rbp points to > >> >> > > >> pt_regs or similar. > >> >> > > > > >> >> > > > I think we should avoid doing something like that because it would break > >> >> > > > gdb and all the other unwinders who don't know about it. > >> >> > > > >> >> > > How so? > >> >> > > > >> >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that, > >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > >> >> > > the pt_regs. Currently it points to something stale (which the > >> >> > > dump_stack code might be relying on. Hmm.) But it's probably also > >> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs > >> >> > > is the next thing on the stack, so just doing the section thing would > >> >> > > work. > >> >> > > >> >> > Yes, rbp is meaningless on the entry from user space. But if an > >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have > >> >> > nested entry, rbp keeps its old value, right? So the unwinder can walk > >> >> > past the nested entry frame and keep going until it gets to the original > >> >> > entry. > >> >> > >> >> Yes. > >> >> > >> >> It would be nice if we could do better, though, and actually notice > >> >> the pt_regs and identify the entry. For example, I'd love to see > >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a > >> >> crash. > >> >> > >> >> Also, I think that just following rbp links will lose the > >> >> actual function that took the page fault (or whatever function > >> >> pt_regs->ip actually points to). > >> > > >> > Hm. I think we could fix all that in a more standard way. Whenever a > >> > new pt_regs frame gets saved on entry, we could also create a new stack > >> > frame which points to a fake kernel_entry() function. That would tell > >> > the unwinder there's a pt_regs frame without otherwise breaking frame > >> > pointers across the frame. > >> > > >> > Then I guess we wouldn't need my other solution of putting the idt > >> > entries in a special section. > >> > > >> > How does that sound? > >> > >> Let me try to understand. > >> > >> The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp > >> points to (prev rbp, prev rip) on the stack, and you can follow the > >> chain back. Right now, on a user access page fault or similar, we > >> have rbp (probably) pointing to the interrupted frame, and the > >> interrupted rip isn't saved anywhere that a naive unwinder can find > >> it. (It's in pt_regs, but the rbp chain skips right over that.) > >> > >> We could change the entry code so that an interrupt / idtentry does: > >> > >> push pt_regs > >> push kernel_entry > >> push %rbp > >> mov %rsp, %rbp > >> call handler > >> pop %rbp > >> addq $8, %rsp > >> > >> or similar. That would make it appear that the actual C handler was > >> caused by a dummy function "kernel_entry". Now the unwinder would get > >> to kernel_entry, but it *still* wouldn't find its way to the calling > >> frame, which only solves part of the problem. We could at least teach > >> the unwinder how kernel_entry works and let it decode pt_regs to > >> continue unwinding. This would be nice, and I think it could work. > >> > >> I think I like this, except that, if it used a separate section, it > >> could potentially be faster, as, for each actual entry type, the > >> offset from the C handler frame to pt_regs is a foregone conclusion. > >> But this is pretty simple and performance is already abysmal in most > >> handlers. > >> > >> There's an added benefit to using a separate section, though: we could > >> also annotate the calls with what type of entry they were so the > >> unwinder could print it out nicely. > >> > >> I could be convinced either way. > > > > Ok, I took a stab at this. See the patch below. > > > > In addition to annotating interrupt/exception pt_regs frames, I also > > annotated all the syscall pt_regs, for consistency. > > > > As you mentioned, it will affect performance a bit, but I think it will > > be insignificant. > > > > I think I like this approach better than putting the > > interrupt/idtentry's in a special section, because this is much more > > precise. Especially now that I'm annotating pt_regs syscalls. > > > > Also I think with a few minor changes we could implement your idea of > > annotating the calls with what type of entry they are. But I don't > > think that's really needed, because the name of the interrupt/idtentry > > is already on the stack trace. > > > > Before: > > > > [<ffffffff8143c243>] dump_stack+0x85/0xc2 > > [<ffffffff81073596>] __do_page_fault+0x576/0x5a0 > > [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0 > > [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0 > > [<ffffffff81887058>] async_page_fault+0x28/0x30 > > [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440 > > [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290 > > [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770 > > [<ffffffff81285e32>] __vfs_read+0xe2/0x140 > > [<ffffffff81286378>] vfs_read+0x98/0x140 > > [<ffffffff812878c8>] SyS_read+0x58/0xc0 > > [<ffffffff81884dbc>] entry_SYSCALL_64_fastpath+0x1f/0xbd > > > > After: > > > > [<ffffffff8143c243>] dump_stack+0x85/0xc2 > > [<ffffffff81073596>] __do_page_fault+0x576/0x5a0 > > [<ffffffff8107369c>] trace_do_page_fault+0x5c/0x2e0 > > [<ffffffff8106d83c>] do_async_page_fault+0x2c/0xa0 > > [<ffffffff81887422>] async_page_fault+0x32/0x40 > > [<ffffffff81887861>] pt_regs+0x1/0x10 > > [<ffffffff81451560>] ? copy_page_to_iter+0x70/0x440 > > [<ffffffff811ebeac>] ? pagecache_get_page+0x2c/0x290 > > [<ffffffff811edaeb>] generic_file_read_iter+0x26b/0x770 > > [<ffffffff81285e32>] __vfs_read+0xe2/0x140 > > [<ffffffff81286378>] vfs_read+0x98/0x140 > > [<ffffffff812878c8>] SyS_read+0x58/0xc0 > > [<ffffffff81884dc6>] entry_SYSCALL_64_fastpath+0x29/0xdb > > [<ffffffff81887861>] pt_regs+0x1/0x10 > > > > Note this example is with today's unwinder. It could be made smarter to > > get the RIP from the pt_regs so the '?' could be removed from > > copy_page_to_iter(). > > > > Thoughts? > > Maybe I'm coming around to liking this idea. Ok, good :-) > In an ideal world (DWARF support, high-quality unwinder, nice pretty > printer, etc), unwinding across a kernel exception would look like: > > - some_func > - some_other_func > - do_page_fault > - page_fault > > After page_fault, the next unwind step takes us to the faulting RIP > (faulting_func) and reports that all GPRs are known. It should > probably learn this fact from DWARF if DWARF is available, instead of > directly decoding pt_regs, due to a few funny cases in which pt_regs > may be incomplete. A nice pretty printer could now print all the > regs. > > - faulting_func > - etc. > > For this to work, we don't actually need the unwinder to explicitly > know where pt_regs is. That's true (but only for DWARF). > Food for thought, though: if user code does SYSENTER with TF set, > you'll end up with partial pt_regs. There's nothing the kernel can do > about it. DWARF will handle it without any fanfare, but I don't know > if it's going to cause trouble for you pre-DWARF. In this case it should see the stack pointer is past the pt_regs offset, so it would just report it as an empty stack. > I'm also not sure it makes sense to apply this before the unwinder > that can consume it is ready. Maybe, if it would be consistent with > your plans, it would make sense to rewrite the unwinder first, then > apply this and teach live patching to use the new unwinder, and *then* > add DWARF support? For the purposes of livepatch, the reliable unwinder needs to detect whether an interrupt/exception pt_regs frame exists on a sleeping task (or current). This patch would allow us to do that. So my preferred order of doing things would be: 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes 2) this patch for annotating pt_regs stack frames 3) reliable unwinder, similar to what I already posted, except it relies on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with the new inactive task frame format of #1 4) livepatch consistency model which uses the reliable unwinder 5) rewrite unwinder, and port all users to the new interface 6) add DWARF unwinder 1-4 are pretty much already written, whereas 5 and 6 will take considerably more work. > > + /* > > + * Create a stack frame for the saved pt_regs. This allows frame > > + * pointer based unwinders to find pt_regs on the stack. > > + */ > > + .macro CREATE_PT_REGS_FRAME regs=%rsp > > +#ifdef CONFIG_FRAME_POINTER > > + pushq \regs > > + pushq $pt_regs+1 > > Why the +1? Some unwinders like gdb are smart enough to report the function which contains the instruction *before* the return address. Without the +1, they would show the wrong function. > > + pushq %rbp > > + movq %rsp, %rbp > > +#endif > > + .endm > > I keep wanting this to be only two pushes and to fudge rbp to make it > work, but I don't see a good way. But let's call it > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to > nested_frame or similar. Or, if we aren't going to annotate syscall pt_regs, we could give it a more specific name. CREATE_INTERRUPT_FRAME and interrupt_frame()? > > + > > + .macro CALL_HANDLER handler regs=%rsp > > + CREATE_PT_REGS_FRAME \regs > > + call \handler > > + REMOVE_PT_REGS_FRAME > > + .endm > > I think I'd rather open-code this everywhere. It'll make it clearer > what's going on. Ok. > > @@ -199,6 +199,7 @@ entry_SYSCALL_64_fastpath: > > ja 1f /* return -ENOSYS (already in pt_regs->ax) */ > > movq %r10, %rcx > > > > + CREATE_PT_REGS_FRAME > > /* > > * This call instruction is handled specially in stub_ptregs_64. > > * It might end up jumping to the slow path. If it jumps, RAX > > @@ -207,6 +208,8 @@ entry_SYSCALL_64_fastpath: > > call *sys_call_table(, %rax, 8) > > .Lentry_SYSCALL_64_after_fastpath_call: > > > > + REMOVE_PT_REGS_FRAME > > + > > As discussed, let's get rid of this bit. Yeah, it's fine with me to get rid of all the syscall stuff. > > > movq %rax, RAX(%rsp) > > 1: > > > > @@ -238,14 +241,14 @@ entry_SYSCALL_64_fastpath: > > ENABLE_INTERRUPTS(CLBR_NONE) > > SAVE_EXTRA_REGS > > movq %rsp, %rdi > > - call syscall_return_slowpath /* returns with IRQs disabled */ > > + CALL_HANDLER syscall_return_slowpath /* returns with IRQs disabled */ > > and this. This will be gone... > > > jmp return_from_SYSCALL_64 > > > > entry_SYSCALL64_slow_path: > > /* IRQs are off. */ > > SAVE_EXTRA_REGS > > movq %rsp, %rdi > > - call do_syscall_64 /* returns with IRQs disabled */ > > + CALL_HANDLER do_syscall_64 /* returns with IRQs disabled */ > > > > return_from_SYSCALL_64: > > RESTORE_EXTRA_REGS > > @@ -344,6 +347,7 @@ ENTRY(stub_ptregs_64) > > DISABLE_INTERRUPTS(CLBR_NONE) > > TRACE_IRQS_OFF > > popq %rax > > + REMOVE_PT_REGS_FRAME > > This will be less mysterious if you open-code the macros. Also, I > think you have to, some return_from_SYSCALL_64 needs to be directly > after the actual call instruction. (But if you get rid of the hunks > above, I think this goes away too, so this may be moot.) and this... > > 1: > > @@ -372,7 +376,7 @@ END(ptregs_\func) > > ENTRY(ret_from_fork) > > LOCK ; btr $TIF_FORK, TI_flags(%r8) > > > > - call schedule_tail /* rdi: 'prev' task parameter */ > > + CALL_HANDLER schedule_tail /* rdi: 'prev' task parameter */ > > > > If you end up making the unwinder smart enough to notice that rsp is > just below pt_regs, then this can go away. It's harmless, though. and this... > > testb $3, CS(%rsp) /* from kernel_thread? */ > > jnz 1f > > @@ -385,8 +389,9 @@ ENTRY(ret_from_fork) > > * parameter to be passed in RBP. The called function is permitted > > * to call do_execve and thereby jump to user mode. > > */ > > + movq RBX(%rsp), %rbx > > movq RBP(%rsp), %rdi > > - call *RBX(%rsp) > > + CALL_HANDLER *%rbx > > Does using a register like this actually save any code size? > Admittedly, it's a bit cleaner. and this. (FWIW, I used a register because the assembler macro didn't seem to support passing "*RBX(%rsp)" as an argument.) > > + > > +/* fake function which allows stack unwinders to detect pt_regs frames */ > > +#ifdef CONFIG_FRAME_POINTER > > +ENTRY(pt_regs) > > + nop > > + nop > > +ENDPROC(pt_regs) > > +#endif /* CONFIG_FRAME_POINTER */ > > Why is this two bytes long? Is there some reason it has to be more > than one byte? Similar to above, this is related to the need to support various unwinders. Whether the unwinder displays the ret_addr or the instruction preceding it, either way the instruction needs to be inside the function for the function to be reported. -- Josh -- To unsubscribe from this list: send the line "unsubscribe live-patching" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html