On Fri, May 19, 2017 at 10:23:53PM -0700, Andy Lutomirski wrote: > On Fri, May 19, 2017 at 2:35 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote: > > On Fri, May 19, 2017 at 04:29:13PM -0500, Josh Poimboeuf wrote: > >> > How are you handling control flow? > >> > >> Control flow of what? > >> > >> > > Here's the struct in its current state: > >> > > > >> > > #define UNDWARF_REG_UNDEFINED 0 > >> > > #define UNDWARF_REG_CFA 1 > >> > > #define UNDWARF_REG_SP 2 > >> > > #define UNDWARF_REG_FP 3 > >> > > #define UNDWARF_REG_SP_INDIRECT 4 > >> > > #define UNDWARF_REG_FP_INDIRECT 5 > >> > > #define UNDWARF_REG_R10 6 > >> > > #define UNDWARF_REG_DI 7 > >> > > #define UNDWARF_REG_DX 8 > >> > > > >> > > >> > Why only those registers? Also, if you have the option I would really > >> > suggest using the actual x86 register numbers (ax, ex, dx, bx, sp, bp, > >> > si, di, r8-r15 in that order.) > >> > >> Those are the only registers which are ever needed as the base for > >> finding the previous stack frame. 99% of the time it's sp or bp, the > >> other registers are needed for aligned stacks and entry code. > >> > >> Using the actual register numbers isn't an option because I don't need > >> them all and they need to fit in a small number of bits. > >> > >> This construct might be useful for other arches, which is why I called > >> it "FP" instead of "BP". But then I ruined that with the last 3 :-) > > > > BTW, here's the link to the unwinder code if you're interested: > > > > https://github.com/jpoimboe/linux/blob/undwarf/arch/x86/kernel/unwind_undwarf.c > > At the risk of potentially overcomplicating matters, here's a > suggestion. As far as I know, all (or most all?) unwinders > effectively do the following in a loop: > > 1. Look up the IP to figure out how to unwind from that IP. > 2. Use the results of that lookup to compute the previous frame state. > > The results of step 1 could perhaps be expressed like this: > > struct reg_formula { > unsigned int source_reg :4; > long offset; > bool dereference; /* true: *(reg + offset); false: (reg + offset) */ > /* For DWARF, I think this can be considerably more complicated, but > I doubt it's useful. */ > }; > > struct unwind_step { > u16 available_regs; /* mask of the caller frame regs that we are > able to recover */ > struct reg_formula[16]; > }; Ok, so I assume you mean we would need to have an in-kernel DWARF reader which reads .eh_frame and converts it to the above, which is called for every step of the unwind phase. > The CFA computation is just reg_formula[UNWIND_REG_SP] (or that plus > or minus sizeof(unsigned long) or whatever -- I can never remember > exactly what CFA refers to.) For a frame pointer-based unwinder, the > entire unwind_step is a foregone conclusion independent of IP: SP = BP > + 8 (or whatever), BP = *(BP + whatever), all other regs unknown. > > Could it make sense to actually structure the code this way? I can > see a few advantages. It would make the actual meat of the unwind > loop totally independent of the unwinding algorithm in use, Yes, this part of it is an interesting idea, separating the debuginfo reading step from the unwinding step. And for people who don't want to carry around megs of DWARF data, get_dwarf_step() could just be a fake lookup which always returns the frame pointer version. > it would > make the meat be dead simple (and thus easy to verify for > non-crashiness), and I think it just might open the door for a real > in-kernel DWARF unwinder that Linus would be okay with. Specifically, > write a function like: > > bool get_dwarf_step(struct unwind_step *step, unsigned long ip); If we keep the frame pointer encoding thing for non-DWARF kernels then we may also need to pass in bp as well. Or maybe we could force frame pointer users to at least have DWARF data for the entry code. Then even frame pointer kernels could detect entry regs and we could get rid of that nasty frame pointer encoding thing. > Put this function in its own file and make it buildable as kernel code > or as user code. Write a test case that runs it on every single > address on the kernel image (in user mode!) with address-sanitizer > enabled (or in Valgrind or both) and make sure that (a) it doesn't > blow up and (b) that the results are credible (e.g. by comparing to > objtool output). Heck, you could even fuzz-test it where the fuzzer > is allowed to corrupt the actual DWARF data. You could do the same > thing with whatever crazy super-compacted undwarf scheme someone comes > up with down the road, too. I think your proposal can be separated into two ideas, which can each be considered on their own merit: 1) Put .eh_frame in the kernel, along with an in-kernel DWARF unwinder. Manage the complexity of the unwinder by validating the output of the complex part of the algorithm in user space. That's a lot of hoops to jump through. The only real advantage I can see is that it would allow us to use the toolchain's DWARF data. But that data is going to have issues anyway because of inline asm, special sections (e.g., exception tables), generated code (e.g., bpf), hand-coded asm with missing/incorrect annotations, etc. Now, we could use objtool to find such issues and warn about them. Then those issues could be corrected, either by hand or programmatically. But then, if we're going that far, why not just have objtool reformat the data into something much simpler? It already has the knowledge to do so. Then we don't have to jump through all those hoops to justify jumping through more hoops in the kernel (i.e., having a complex DWARF state machine). With a simple debuginfo format, the kernel unwinder is simple enough that we don't need to validate its functionality in a simulator. 2) Make a unified unwinder which uses get_dwarf_step() to abstract out the differences between frame pointers and DWARF (or undwarf or whatever else). This could an interesting. Though I'm not sure how it would integrate with our "guess" unwinder for kernels which don't have frame pointers or DWARF/undwarf. I guess we could still keep that one separate. > I personally like the idea of using real DWARF annotations in the > entry code because it makes gdb work better (not kgdb -- real gdb > attached to KVM). > > I bet that we could get entry asm annotations into > good shape if we really wanted to. OTOH, getting DWARF to work well > for inline asm is really nasty IIRC. > > (H.J., could we get a binutils feature that allows is to do: > > pushq %whatever > .cfi_adjust_sp -8 > ... > popq %whatever > .cfi_adjust_sp 8 > > that will emit the right DWARF instructions regardless of whether > there's a frame pointer or not? .cfi_adjust_cfa_offset is not > particularly helpful here because it's totally wrong if the CFA is > currently being computed based on BP.) I agree that entry_64.o should have DWARF data, regardless of what the in-kernel representation of the data looks like. And the same for all the other asm .o files. But how we achieve that is debatable. In the past, having all the manual .cfi annotations everywhere for every push/pop was an unmaintainable disaster. If we could have binutils do automatic adjustments for pushes and pops and sp adds/subtracts without the .cfi annotations, that would be great. Otherwise, objtool can do it. And it can write both undwarf and DWARF, if needed. > Also, you read the stack like this: > > *val = *(unsigned long *)addr; > > how about probe_kernel_read() instead? It depends on how paranoid you want to be. The undwarf code has the level of paranoia the frame pointer code has always had. In other words, it ensures each dereference is within the bounds of the current stack. It relies on task->stack and the percpu exception/irq stack pointers, as well as the previous stack pointers at the edge of each stack. I don't think we've ever had a problem with that level of paranoia, so I think staying with the status quo might be fine. If we did use probe_kernel_read() we'd have to use some other form of it because it does some kasan and hardened usercopy checks which wouldn't always be appropriate. -- Josh -- To unsubscribe from this list: send the line "unsubscribe live-patching" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html