On Mon, May 2, 2016 at 10:31 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote: > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote: >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote: >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@xxxxxxxxxx> wrote: >> >> > >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote: >> >> > > >> I suppose we could try to rejigger the code so that rbp points to >> >> > > >> pt_regs or similar. >> >> > > > >> >> > > > I think we should avoid doing something like that because it would break >> >> > > > gdb and all the other unwinders who don't know about it. >> >> > > >> >> > > How so? >> >> > > >> >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that, >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to >> >> > > the pt_regs. Currently it points to something stale (which the >> >> > > dump_stack code might be relying on. Hmm.) But it's probably also >> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs >> >> > > is the next thing on the stack, so just doing the section thing would >> >> > > work. >> >> > >> >> > Yes, rbp is meaningless on the entry from user space. But if an >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have >> >> > nested entry, rbp keeps its old value, right? So the unwinder can walk >> >> > past the nested entry frame and keep going until it gets to the original >> >> > entry. >> >> >> >> Yes. >> >> >> >> It would be nice if we could do better, though, and actually notice >> >> the pt_regs and identify the entry. For example, I'd love to see >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a >> >> crash. >> >> >> >> Also, I think that just following rbp links will lose the >> >> actual function that took the page fault (or whatever function >> >> pt_regs->ip actually points to). >> > >> > Hm. I think we could fix all that in a more standard way. Whenever a >> > new pt_regs frame gets saved on entry, we could also create a new stack >> > frame which points to a fake kernel_entry() function. That would tell >> > the unwinder there's a pt_regs frame without otherwise breaking frame >> > pointers across the frame. >> > >> > Then I guess we wouldn't need my other solution of putting the idt >> > entries in a special section. >> > >> > How does that sound? >> >> Let me try to understand. >> >> The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp >> points to (prev rbp, prev rip) on the stack, and you can follow the >> chain back. Right now, on a user access page fault or similar, we >> have rbp (probably) pointing to the interrupted frame, and the >> interrupted rip isn't saved anywhere that a naive unwinder can find >> it. (It's in pt_regs, but the rbp chain skips right over that.) >> >> We could change the entry code so that an interrupt / idtentry does: >> >> push pt_regs >> push kernel_entry >> push %rbp >> mov %rsp, %rbp >> call handler >> pop %rbp >> addq $8, %rsp >> >> or similar. That would make it appear that the actual C handler was >> caused by a dummy function "kernel_entry". Now the unwinder would get >> to kernel_entry, but it *still* wouldn't find its way to the calling >> frame, which only solves part of the problem. We could at least teach >> the unwinder how kernel_entry works and let it decode pt_regs to >> continue unwinding. This would be nice, and I think it could work. > > Yeah, that's about what I had in mind. FWIW, I just tried this: static bool is_entry_text(unsigned long addr) { return addr >= (unsigned long)__entry_text_start && addr < (unsigned long)__entry_text_end; } it works. So the entry code is already annotated reasonably well :) I just hacked it up here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21 and it seems to work, at least for page faults. A better implementation would print out the entire contents of pt_regs so that people reading the stack trace will know the registers at the time of the exception, which might be helpful. > >> I think I like this, except that, if it used a separate section, it >> could potentially be faster, as, for each actual entry type, the >> offset from the C handler frame to pt_regs is a foregone conclusion. > > Hm, this I don't really follow. It's true that the unwinder can easily > find RIP from pt_regs, which will always be a known offset from the > kernel_entry pointer on the stack. But why would having the entry code > in a separate section make that faster? It doesn't make the unwinder faster -- it makes the entry code faster. > >> But this is pretty simple and performance is already abysmal in most >> handlers. >> >> There's an added benefit to using a separate section, though: we could >> also annotate the calls with what type of entry they were so the >> unwinder could print it out nicely. > > Yeah, that could be a nice feature... but doesn't printing the name of > the C handler pretty much already give that information? > > In any case, once we have a working DWARF unwinder, I think it will show > the name of the idt entry anyway. True. And it'll automatically follow pt_regs. > >> >> Have you looked at my vdso unwinding test at all? If we could do >> >> something similar for the kernel, IMO it would make testing much more >> >> pleasant. >> > >> > I found it, but I'm not sure what it would mean to do something similar >> > for the kernel. Do you mean doing something like an NMI sampling-based >> > approach where we periodically do a random stack sanity check? >> >> I was imagining something a little more strict: single-step >> interesting parts of the kernel and make sure that each step unwinds >> correctly. That could detect missing frames and similar. > > Interesting idea, though I wonder how hard it would be to reliably > distinguish a missing frame from the case where gcc decides to inline a > function. > > Another idea to detect missing frames: for each return address on the > stack, ensure there's a corresponding "call <func>" instruction > immediately preceding the return location, where <func> matches what's > on the stack. Hmm, interesting. I hope your plans include rewriting the current stack unwinder completely. The thing in print_context_stack is (a) hard-to-understand and hard-to-modify crap and (b) is called in a loop from another file using totally ridiculous conventions. --Andy -- To unsubscribe from this list: send the line "unsubscribe live-patching" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html