Add a new 'undwarf' unwinder which is enabled by CONFIG_UNDWARF_UNWINDER. It plugs into the existing x86 unwinder framework. It relies on objtool to generate the needed .undwarf section. For more details on why undwarf is used instead of DWARF, see tools/objtool/Documentation/undwarf.txt. Thanks to Andy Lutomirski for the performance improvement ideas: splitting the undwarf table into two parallel arrays and creating a fast lookup table to search a subset of the undwarf table. Signed-off-by: Josh Poimboeuf <jpoimboe@xxxxxxxxxx> --- Documentation/x86/undwarf.txt | 146 ++++++++++ arch/um/include/asm/unwind.h | 8 + arch/x86/Kconfig | 1 + arch/x86/Kconfig.debug | 25 ++ arch/x86/include/asm/module.h | 9 + arch/x86/include/asm/unwind.h | 77 +++-- arch/x86/kernel/Makefile | 8 +- arch/x86/kernel/module.c | 12 +- arch/x86/kernel/setup.c | 3 + arch/x86/kernel/unwind_frame.c | 39 ++- arch/x86/kernel/unwind_guess.c | 5 + arch/x86/kernel/unwind_undwarf.c | 589 ++++++++++++++++++++++++++++++++++++++ arch/x86/kernel/vmlinux.lds.S | 2 + include/asm-generic/vmlinux.lds.h | 20 +- lib/Kconfig.debug | 3 + scripts/Makefile.build | 14 +- 16 files changed, 898 insertions(+), 63 deletions(-) create mode 100644 Documentation/x86/undwarf.txt create mode 100644 arch/um/include/asm/unwind.h create mode 100644 arch/x86/kernel/unwind_undwarf.c diff --git a/Documentation/x86/undwarf.txt b/Documentation/x86/undwarf.txt new file mode 100644 index 0000000..d76c6b4 --- /dev/null +++ b/Documentation/x86/undwarf.txt @@ -0,0 +1,146 @@ +Undwarf unwinder debuginfo generation +===================================== + +Overview +-------- + +The kernel CONFIG_UNDWARF_UNWINDER option enables objtool generation of +undwarf debuginfo, which is out-of-band data which is used by the +in-kernel undwarf unwinder. It's similar in concept to DWARF CFI +debuginfo which would be used by a DWARF unwinder. The difference is +that the format of the undwarf data is simpler than DWARF, which in turn +allows the unwinder to be simpler and faster. + +Objtool generates the undwarf data by first doing compile-time stack +metadata validation (CONFIG_STACK_VALIDATION). After analyzing all the +code paths of a .o file, it determines information about the stack state +at each instruction address in the file and outputs that information to +the .undwarf and .undwarf_ip sections. + +The undwarf sections are combined at link time and are sorted at boot +time. The unwinder uses the resulting data to correlate instruction +addresses with their stack states at run time. + + +Undwarf vs frame pointers +------------------------- + +With frame pointers enabled, GCC adds instrumentation code to every +function in the kernel. The kernel's .text size increases by about +3.2%, resulting in a broad kernel-wide slowdown. Measurements by Mel +Gorman [1] have shown a slowdown of 5-10% for some workloads. + +In contrast, the undwarf unwinder has no effect on text size or runtime +performance, because the debuginfo is out of band. So if you disable +frame pointers and enable undwarf, you get a nice performance +improvement across the board, and still have reliable stack traces. + +Another benefit of undwarf compared to frame pointers is that it can +reliably unwind across interrupts and exceptions. Frame pointer based +unwinds can skip the caller of the interrupted function if it was a leaf +function or if the interrupt hit before the frame pointer was saved. + +The main disadvantage of undwarf compared to frame pointers is that it +needs more memory to store the undwarf table: roughly 2-4MB depending on +the kernel config. + + +Undwarf vs DWARF +---------------- + +Undwarf debuginfo's advantage over DWARF itself is that it's much +simpler. It gets rid of the complex DWARF CFI state machine and also +gets rid of the tracking of unnecessary registers. This allows the +unwinder to be much simpler, meaning fewer bugs, which is especially +important for mission critical oops code. + +The simpler debuginfo format also enables the unwinder to be much faster +than DWARF, which is important for perf and lockdep. In a basic +performance test by Jiri Slaby [2], the undwarf unwinder was about 20x +faster than an out-of-tree DWARF unwinder. (Note: that measurement was +taken before some performance tweaks were implemented, so the speedup +may be even higher.) + +The undwarf format does have a few downsides compared to DWARF. The +undwarf table takes up ~2MB more memory than an DWARF .eh_frame table. + +Another potential downside is that, as GCC evolves, it's conceivable +that the undwarf data may end up being *too* simple to describe the +state of the stack for certain optimizations. But IMO this is unlikely +because GCC saves the frame pointer for any unusual stack adjustments it +does, so I suspect we'll really only ever need to keep track of the +stack pointer and the frame pointer between call frames. But even if we +do end up having to track all the registers DWARF tracks, at least we +will still be able to control the format, e.g. no complex state +machines. + + +Undwarf debuginfo generation +---------------------------- + +The undwarf data is generated by objtool. With the existing +compile-time stack metadata validation feature, objtool already follows +all code paths, and so it already has all the information it needs to be +able to generate undwarf data from scratch. So it's an easy step to go +from stack validation to undwarf generation. + +It should be possible to instead generate the undwarf data with a simple +tool which converts DWARF to undwarf. However, such a solution would be +incomplete due to the kernel's extensive use of asm, inline asm, and +special sections like exception tables. + +That could be rectified by manually annotating those special code paths +using GNU assembler .cfi annotations in .S files, and homegrown +annotations for inline asm in .c files. But asm annotations were tried +in the past and were found to be unmaintainable. They were often +incorrect/incomplete and made the code harder to read and keep updated. +And based on looking at glibc code, annotating inline asm in .c files +might be even worse. + +Objtool still needs a few annotations, but only in code which does +unusual things to the stack like entry code. And even then, far fewer +annotations are needed than what DWARF would need, so they're much more +maintainable than DWARF CFI annotations. + +So the advantages of using objtool to generate undwarf are that it gives +more accurate debuginfo, with very few annotations. It also insulates +the kernel from toolchain bugs which can be very painful to deal with in +the kernel since we often have to workaround issues in older versions of +the toolchain for years. + +The downside is that the unwinder now becomes dependent on objtool's +ability to reverse engineer GCC code paths. If GCC optimizations become +too complicated for objtool to follow, the undwarf generation might stop +working or become incomplete. (It's worth noting that livepatch already +has such a dependency on objtool's ability to follow GCC code paths.) + +If newer versions of GCC come up with some optimizations which break +objtool, we may need to revisit the current implementation. Some +possible solutions would be asking GCC to make the optimizations more +palatable, or having objtool use DWARF as an additional input, or +creating a GCC plugin to assist objtool with its analysis. But for now, +objtool follows GCC code quite well. + + +Unwinder implementation details +------------------------------- + +Objtool generates the undwarf data by integrating with the compile-time +stack metadata validation feature, which is described in detail in +tools/objtool/Documentation/stack-validation.txt. After analyzing all +the code paths of a .o file, it creates an array of undwarf structs, and +a parallel array of instruction addresses associated with those structs, +and writes them to the .undwarf and .undwarf_ip sections respectively. + +The undwarf data is split into the two arrays for performance reasons, +to make the searchable part of the data (.undwarf_ip) more compact. The +arrays are sorted in parallel at boot time. + +Performance is further improved by the use of a fast lookup table which +is created at runtime. The fast lookup table associates a given address +with a range of undwarf table indices, so that only a small subset of +the undwarf table needs to be searched. + + +[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@xxxxxxx +[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@xxxxxxx diff --git a/arch/um/include/asm/unwind.h b/arch/um/include/asm/unwind.h new file mode 100644 index 0000000..53f507c --- /dev/null +++ b/arch/um/include/asm/unwind.h @@ -0,0 +1,8 @@ +#ifndef _ASM_UML_UNWIND_H +#define _ASM_UML_UNWIND_H + +static inline void +unwind_module_init(struct module *mod, void *undwarf_ip, size_t unward_ip_size, + void *undwarf, size_t undwarf_size) {} + +#endif /* _ASM_UML_UNWIND_H */ diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 72028a1..adf3222 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -153,6 +153,7 @@ config X86 select HAVE_MEMBLOCK select HAVE_MEMBLOCK_NODE_MAP select HAVE_MIXED_BREAKPOINTS_REGS + select HAVE_MOD_ARCH_SPECIFIC select HAVE_NMI select HAVE_OPROFILE select HAVE_OPTPROBES diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug index fcb7604..995434c 100644 --- a/arch/x86/Kconfig.debug +++ b/arch/x86/Kconfig.debug @@ -357,4 +357,29 @@ config PUNIT_ATOM_DEBUG The current power state can be read from /sys/kernel/debug/punit_atom/dev_power_state +config UNDWARF_UNWINDER + bool "undwarf unwinder" + depends on X86_64 + select STACK_VALIDATION + ---help--- + This option enables the "undwarf" unwinder for unwinding kernel stack + traces. It uses a custom data format which is a simplified version + of the DWARF Call Frame Information standard. + + This unwinder is more accurate across interrupt entry frames than the + frame pointer unwinder. This also can enable a small performance + improvement across the entire kernel if CONFIG_FRAME_POINTER is + disabled. + + Enabling this option will increase the kernel's runtime memory usage + by roughly 2-4MB, depending on your kernel config. + +config FRAME_POINTER_UNWINDER + def_bool y + depends on !UNDWARF_UNWINDER && FRAME_POINTER + +config GUESS_UNWINDER + def_bool y + depends on !UNDWARF_UNWINDER && !FRAME_POINTER + endmenu diff --git a/arch/x86/include/asm/module.h b/arch/x86/include/asm/module.h index e3b7819..4dc6427 100644 --- a/arch/x86/include/asm/module.h +++ b/arch/x86/include/asm/module.h @@ -2,6 +2,15 @@ #define _ASM_X86_MODULE_H #include <asm-generic/module.h> +#include <asm/undwarf.h> + +struct mod_arch_specific { +#ifdef CONFIG_UNDWARF_UNWINDER + unsigned int num_undwarves; + int *undwarf_ip; + struct undwarf *undwarf; +#endif +}; #ifdef CONFIG_X86_64 /* X86_64 does not define MODULE_PROC_FAMILY */ diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h index e667649..1f8cb78 100644 --- a/arch/x86/include/asm/unwind.h +++ b/arch/x86/include/asm/unwind.h @@ -12,11 +12,14 @@ struct unwind_state { struct task_struct *task; int graph_idx; bool error; -#ifdef CONFIG_FRAME_POINTER +#if defined(CONFIG_UNDWARF_UNWINDER) + bool signal, full_regs; + unsigned long sp, bp, ip; + struct pt_regs *regs; +#elif defined(CONFIG_FRAME_POINTER) bool got_irq; - unsigned long *bp, *orig_sp; + unsigned long *bp, *orig_sp, ip; struct pt_regs *regs; - unsigned long ip; #else unsigned long *sp; #endif @@ -24,41 +27,30 @@ struct unwind_state { void __unwind_start(struct unwind_state *state, struct task_struct *task, struct pt_regs *regs, unsigned long *first_frame); - bool unwind_next_frame(struct unwind_state *state); - unsigned long unwind_get_return_address(struct unwind_state *state); +unsigned long *unwind_get_return_address_ptr(struct unwind_state *state); static inline bool unwind_done(struct unwind_state *state) { return state->stack_info.type == STACK_TYPE_UNKNOWN; } -static inline -void unwind_start(struct unwind_state *state, struct task_struct *task, - struct pt_regs *regs, unsigned long *first_frame) -{ - first_frame = first_frame ? : get_stack_pointer(task, regs); - - __unwind_start(state, task, regs, first_frame); -} - static inline bool unwind_error(struct unwind_state *state) { return state->error; } -#ifdef CONFIG_FRAME_POINTER - static inline -unsigned long *unwind_get_return_address_ptr(struct unwind_state *state) +void unwind_start(struct unwind_state *state, struct task_struct *task, + struct pt_regs *regs, unsigned long *first_frame) { - if (unwind_done(state)) - return NULL; + first_frame = first_frame ? : get_stack_pointer(task, regs); - return state->regs ? &state->regs->ip : state->bp + 1; + __unwind_start(state, task, regs, first_frame); } +#if defined(CONFIG_UNDWARF_UNWINDER) || defined(CONFIG_FRAME_POINTER) static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state) { if (unwind_done(state)) @@ -66,20 +58,47 @@ static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state) return state->regs; } - -#else /* !CONFIG_FRAME_POINTER */ - -static inline -unsigned long *unwind_get_return_address_ptr(struct unwind_state *state) +#else +static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state) { return NULL; } +#endif -static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state) +#ifdef CONFIG_UNDWARF_UNWINDER +void unwind_init(void); +void unwind_module_init(struct module *mod, void *undwarf_ip, + size_t undwarf_ip_size, void *undwarf, + size_t undwarf_size); +#else +static inline void unwind_init(void) {} +static inline void unwind_module_init(struct module *mod, void *undwarf_ip, + size_t undwarf_ip_size, void *undwarf, + size_t undwarf_size) {} +#endif + +/* + * This disables KASAN checking when reading a value from another task's stack, + * since the other task could be running on another CPU and could have poisoned + * the stack in the meantime. + */ +#define READ_ONCE_TASK_STACK(task, x) \ +({ \ + unsigned long val; \ + if (task == current) \ + val = READ_ONCE(x); \ + else \ + val = READ_ONCE_NOCHECK(x); \ + val; \ +}) + +static inline bool task_on_another_cpu(struct task_struct *task) { - return NULL; +#ifdef CONFIG_SMP + return task != current && task->on_cpu; +#else + return false; +#endif } -#endif /* CONFIG_FRAME_POINTER */ - #endif /* _ASM_X86_UNWIND_H */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 3c7c419..4865889 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -125,11 +125,9 @@ obj-$(CONFIG_PERF_EVENTS) += perf_regs.o obj-$(CONFIG_TRACING) += tracepoint.o obj-$(CONFIG_SCHED_MC_PRIO) += itmt.o -ifdef CONFIG_FRAME_POINTER -obj-y += unwind_frame.o -else -obj-y += unwind_guess.o -endif +obj-$(CONFIG_UNDWARF_UNWINDER) += unwind_undwarf.o +obj-$(CONFIG_FRAME_POINTER_UNWINDER) += unwind_frame.o +obj-$(CONFIG_GUESS_UNWINDER) += unwind_guess.o ### # 64 bit specific files diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index f67bd32..203b5a7 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -35,6 +35,7 @@ #include <asm/page.h> #include <asm/pgtable.h> #include <asm/setup.h> +#include <asm/unwind.h> #if 0 #define DEBUGP(fmt, ...) \ @@ -213,7 +214,7 @@ int module_finalize(const Elf_Ehdr *hdr, struct module *me) { const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL, - *para = NULL; + *para = NULL, *undwarf = NULL, *undwarf_ip = NULL; char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset; for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) { @@ -225,6 +226,10 @@ int module_finalize(const Elf_Ehdr *hdr, locks = s; if (!strcmp(".parainstructions", secstrings + s->sh_name)) para = s; + if (!strcmp(".undwarf", secstrings + s->sh_name)) + undwarf = s; + if (!strcmp(".undwarf_ip", secstrings + s->sh_name)) + undwarf_ip = s; } if (alt) { @@ -248,6 +253,11 @@ int module_finalize(const Elf_Ehdr *hdr, /* make jump label nops */ jump_label_apply_nops(me); + if (undwarf && undwarf_ip) + unwind_module_init(me, (void *)undwarf_ip->sh_addr, + undwarf_ip->sh_size, + (void *)undwarf->sh_addr, undwarf->sh_size); + return 0; } diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 65622f0..d736761 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -115,6 +115,7 @@ #include <asm/microcode.h> #include <asm/mmu_context.h> #include <asm/kaslr.h> +#include <asm/unwind.h> /* * max_low_pfn_mapped: highest direct mapped pfn under 4GB @@ -1303,6 +1304,8 @@ void __init setup_arch(char **cmdline_p) if (efi_enabled(EFI_BOOT)) efi_apply_memmap_quirks(); #endif + + unwind_init(); } #ifdef CONFIG_X86_32 diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c index b9389d7..7574ef5 100644 --- a/arch/x86/kernel/unwind_frame.c +++ b/arch/x86/kernel/unwind_frame.c @@ -10,20 +10,22 @@ #define FRAME_HEADER_SIZE (sizeof(long) * 2) -/* - * This disables KASAN checking when reading a value from another task's stack, - * since the other task could be running on another CPU and could have poisoned - * the stack in the meantime. - */ -#define READ_ONCE_TASK_STACK(task, x) \ -({ \ - unsigned long val; \ - if (task == current) \ - val = READ_ONCE(x); \ - else \ - val = READ_ONCE_NOCHECK(x); \ - val; \ -}) +unsigned long unwind_get_return_address(struct unwind_state *state) +{ + if (unwind_done(state)) + return 0; + + return __kernel_text_address(state->ip) ? state->ip : 0; +} +EXPORT_SYMBOL_GPL(unwind_get_return_address); + +unsigned long *unwind_get_return_address_ptr(struct unwind_state *state) +{ + if (unwind_done(state)) + return NULL; + + return state->regs ? &state->regs->ip : state->bp + 1; +} static void unwind_dump(struct unwind_state *state) { @@ -66,15 +68,6 @@ static void unwind_dump(struct unwind_state *state) } } -unsigned long unwind_get_return_address(struct unwind_state *state) -{ - if (unwind_done(state)) - return 0; - - return __kernel_text_address(state->ip) ? state->ip : 0; -} -EXPORT_SYMBOL_GPL(unwind_get_return_address); - static size_t regs_size(struct pt_regs *regs) { /* x86_32 regs from kernel mode are two words shorter: */ diff --git a/arch/x86/kernel/unwind_guess.c b/arch/x86/kernel/unwind_guess.c index 039f367..4f0e17b 100644 --- a/arch/x86/kernel/unwind_guess.c +++ b/arch/x86/kernel/unwind_guess.c @@ -19,6 +19,11 @@ unsigned long unwind_get_return_address(struct unwind_state *state) } EXPORT_SYMBOL_GPL(unwind_get_return_address); +unsigned long *unwind_get_return_address_ptr(struct unwind_state *state) +{ + return NULL; +} + bool unwind_next_frame(struct unwind_state *state) { struct stack_info *info = &state->stack_info; diff --git a/arch/x86/kernel/unwind_undwarf.c b/arch/x86/kernel/unwind_undwarf.c new file mode 100644 index 0000000..44f62af --- /dev/null +++ b/arch/x86/kernel/unwind_undwarf.c @@ -0,0 +1,589 @@ +#include <linux/module.h> +#include <linux/sort.h> +#include <asm/ptrace.h> +#include <asm/stacktrace.h> +#include <asm/unwind.h> +#include <asm/undwarf.h> +#include <asm/sections.h> + +#define undwarf_warn(fmt, ...) \ + printk_deferred_once(KERN_WARNING pr_fmt("WARNING: " fmt), ##__VA_ARGS__) + +extern int __start_undwarf_ip[]; +extern int __stop_undwarf_ip[]; +extern struct undwarf __start_undwarf[]; +extern struct undwarf __stop_undwarf[]; + +bool undwarf_init; +static DEFINE_MUTEX(sort_mutex); + +int *cur_undwarf_ip_table = __start_undwarf_ip; +struct undwarf *cur_undwarf_table = __start_undwarf; + +/* + * This is a lookup table for speeding up access to the undwarf table. Given + * an input address offset, the corresponding lookup table entry specifies a + * subset of the undwarf table to search. + * + * Each block represents the end of the previous range and the start of the + * next range. An extra block is added to give the last range an end. + * + * Some measured performance results for different values of LOOKUP_NUM_BLOCKS: + * + * num blocks array size lookup speedup total speedup + * 2k 8k 1.5x 1.5x + * 4k 16k 1.6x 1.6x + * 8k 32k 1.8x 1.7x + * 16k 64k 2.0x 1.8x + * 32k 128k 2.5x 2.0x + * 64k 256k 2.9x 2.2x + * 128k 512k 3.3x 2.4x + * + * Go with 32k blocks because it doubles unwinder performance while only adding + * 3.5% to the undwarf data footprint. + */ +#define LOOKUP_NUM_BLOCKS (32 * 1024) +static unsigned int undwarf_fast_lookup[LOOKUP_NUM_BLOCKS + 1] __ro_after_init; + +#define LOOKUP_START_IP (unsigned long)_stext +#define LOOKUP_STOP_IP (unsigned long)_etext +#define LOOKUP_BLOCK_SIZE \ + (DIV_ROUND_UP(LOOKUP_STOP_IP - LOOKUP_START_IP, \ + LOOKUP_NUM_BLOCKS)) + + +static inline unsigned long undwarf_ip(const int *ip) +{ + return (unsigned long)ip + *ip; +} + +static struct undwarf *__undwarf_find(int *ip_table, struct undwarf *u_table, + unsigned int num_entries, + unsigned long ip) +{ + int *first = ip_table; + int *last = ip_table + num_entries - 1; + int *mid = first, *found = first; + + if (!num_entries) + return NULL; + + /* + * Do a binary range search to find the rightmost duplicate of a given + * starting address. Some entries are section terminators which are + * "weak" entries for ensuring there are no gaps. They should be + * ignored when they conflict with a real entry. + */ + while (first <= last) { + mid = first + ((last - first) / 2); + + if (undwarf_ip(mid) <= ip) { + found = mid; + first = mid + 1; + } else + last = mid - 1; + } + + return u_table + (found - ip_table); +} + +static struct undwarf *undwarf_find(unsigned long ip) +{ + struct module *mod; + + if (!undwarf_init) + return NULL; + + /* For non-init vmlinux addresses, use the fast lookup table: */ + if (ip >= LOOKUP_START_IP && ip < LOOKUP_STOP_IP) { + unsigned int idx, start, stop; + + idx = (ip - LOOKUP_START_IP) / LOOKUP_BLOCK_SIZE; + + if (WARN_ON_ONCE(idx >= LOOKUP_NUM_BLOCKS)) + return NULL; + + start = undwarf_fast_lookup[idx]; + stop = undwarf_fast_lookup[idx + 1] + 1; + + if (WARN_ON_ONCE(__start_undwarf + start >= __stop_undwarf) || + __start_undwarf + stop > __stop_undwarf) + return NULL; + + return __undwarf_find(__start_undwarf_ip + start, + __start_undwarf + start, + stop - start, ip); + } + + /* vmlinux .init slow lookup: */ + if (ip >= (unsigned long)_sinittext && ip < (unsigned long)_einittext) + return __undwarf_find(__start_undwarf_ip, __start_undwarf, + __stop_undwarf - __start_undwarf, ip); + + /* Module lookup: */ + mod = __module_address(ip); + if (!mod || !mod->arch.undwarf || !mod->arch.undwarf_ip) + return NULL; + return __undwarf_find(mod->arch.undwarf_ip, mod->arch.undwarf, + mod->arch.num_undwarves, ip); +} + +static void undwarf_sort_swap(void *_a, void *_b, int size) +{ + struct undwarf *undwarf_a, *undwarf_b; + struct undwarf undwarf_tmp; + int *a = _a, *b = _b, tmp; + int delta = _b - _a; + + /* Swap the undwarf_ip entries: */ + tmp = *a; + *a = *b + delta; + *b = tmp - delta; + + /* Swap the corresponding undwarf entries: */ + undwarf_a = cur_undwarf_table + (a - cur_undwarf_ip_table); + undwarf_b = cur_undwarf_table + (b - cur_undwarf_ip_table); + undwarf_tmp = *undwarf_a; + *undwarf_a = *undwarf_b; + *undwarf_b = undwarf_tmp; +} + +static int undwarf_sort_cmp(const void *_a, const void *_b) +{ + struct undwarf *undwarf_a; + const int *a = _a, *b = _b; + unsigned long a_val = undwarf_ip(a); + unsigned long b_val = undwarf_ip(b); + + if (a_val > b_val) + return 1; + if (a_val < b_val) + return -1; + + /* + * The "weak" section terminator entries need to always be on the left + * to ensure the lookup code skips them in favor of real entries. + * These terminator entries exist to handle any gaps created by + * whitelisted .o files which didn't get objtool generation. + */ + undwarf_a = cur_undwarf_table + (a - cur_undwarf_ip_table); + return undwarf_a->cfa_reg == UNDWARF_REG_UNDEFINED ? -1 : 1; +} + +void unwind_module_init(struct module *mod, void *_undwarf_ip, + size_t undwarf_ip_size, void *_undwarf, + size_t undwarf_size) +{ + int *undwarf_ip = _undwarf_ip; + struct undwarf *undwarf = _undwarf; + unsigned int num_entries = undwarf_ip_size / sizeof(int); + + WARN_ON_ONCE(undwarf_ip_size % sizeof(int) != 0 || + undwarf_size % sizeof(*undwarf) != 0 || + num_entries != undwarf_size / sizeof(*undwarf)); + + /* + * The 'cur_undwarf_*' globals allow the undwarf_sort_swap() callback + * to associate an undwarf_ip table entry with its corresponding + * undwarf entry so they can both be swapped. + */ + mutex_lock(&sort_mutex); + cur_undwarf_ip_table = undwarf_ip; + cur_undwarf_table = undwarf; + sort(undwarf_ip, num_entries, sizeof(int),undwarf_sort_cmp, + undwarf_sort_swap); + mutex_unlock(&sort_mutex); + + mod->arch.undwarf_ip = undwarf_ip; + mod->arch.undwarf = undwarf; + mod->arch.num_undwarves = num_entries; +} + +void __init unwind_init(void) +{ + size_t undwarf_ip_size = (void *)__stop_undwarf_ip - (void *)__start_undwarf_ip; + size_t undwarf_size = (void *)__stop_undwarf - (void *)__start_undwarf; + size_t num_entries = undwarf_ip_size / sizeof(int); + struct undwarf *undwarf; + int i; + + if (!num_entries || undwarf_ip_size % sizeof(int) != 0 || + undwarf_size % sizeof(struct undwarf) != 0 || + num_entries != undwarf_size / sizeof(struct undwarf)) { + pr_warn("WARNING: Bad or missing undwarf table. Disabling unwinder.\n"); + return; + } + + /* Sort the undwarf table: */ + sort(__start_undwarf_ip, num_entries, sizeof(int), undwarf_sort_cmp, + undwarf_sort_swap); + + /* Initialize the fast lookup table: */ + for (i = 0; i < LOOKUP_NUM_BLOCKS; i++) { + undwarf = __undwarf_find(__start_undwarf_ip, __start_undwarf, + num_entries, + LOOKUP_START_IP + (LOOKUP_BLOCK_SIZE * i)); + if (!undwarf) { + pr_warn("WARNING: Corrupt undwarf table. Disabling unwinder.\n"); + return; + } + + undwarf_fast_lookup[i] = undwarf - __start_undwarf; + } + + /* Initialize the last 'end' block: */ + undwarf = __undwarf_find(__start_undwarf_ip, __start_undwarf, + num_entries, LOOKUP_STOP_IP); + if (!undwarf) { + pr_warn("WARNING: Corrupt undwarf table. Disabling unwinder.\n"); + return; + } + undwarf_fast_lookup[LOOKUP_NUM_BLOCKS] = undwarf - __start_undwarf; + + undwarf_init = true; +} + +unsigned long unwind_get_return_address(struct unwind_state *state) +{ + if (unwind_done(state)) + return 0; + + return __kernel_text_address(state->ip) ? state->ip : 0; +} +EXPORT_SYMBOL_GPL(unwind_get_return_address); + +unsigned long *unwind_get_return_address_ptr(struct unwind_state *state) +{ + if (unwind_done(state)) + return NULL; + + if (state->regs) + return &state->regs->ip; + + if (state->sp) + return (unsigned long *)state->sp - 1; + + return NULL; +} + +static bool stack_access_ok(struct unwind_state *state, unsigned long addr, + size_t len) +{ + struct stack_info *info = &state->stack_info; + + /* + * If the address isn't on the current stack, switch to the next one. + * + * We may have to traverse multiple stacks to deal with the possibility + * that info->next_sp could point to an empty stack and the address + * could be on a subsequent stack. + */ + while (!on_stack(info, (void *)addr, len)) + if (get_stack_info(info->next_sp, state->task, info, + &state->stack_mask)) + return false; + + return true; +} + +static bool deref_stack_reg(struct unwind_state *state, unsigned long addr, + unsigned long *val) +{ + if (!stack_access_ok(state, addr, sizeof(long))) + return false; + + *val = READ_ONCE_TASK_STACK(state->task, *(unsigned long *)addr); + return true; +} + +#define REGS_SIZE (sizeof(struct pt_regs)) +#define SP_OFFSET (offsetof(struct pt_regs, sp)) +#define IRET_REGS_SIZE (REGS_SIZE - offsetof(struct pt_regs, ip)) +#define IRET_SP_OFFSET (SP_OFFSET - offsetof(struct pt_regs, ip)) + +static bool deref_stack_regs(struct unwind_state *state, unsigned long addr, + unsigned long *ip, unsigned long *sp, bool full) +{ + size_t regs_size = full ? REGS_SIZE : IRET_REGS_SIZE; + size_t sp_offset = full ? SP_OFFSET : IRET_SP_OFFSET; + struct pt_regs *regs = (struct pt_regs *)(addr + regs_size - REGS_SIZE); + + if (IS_ENABLED(CONFIG_X86_64)) { + if (!stack_access_ok(state, addr, regs_size)) + return false; + + *ip = regs->ip; + *sp = regs->sp; + + return true; + } + + if (!stack_access_ok(state, addr, sp_offset)) + return false; + + *ip = regs->ip; + + if (user_mode(regs)) { + if (!stack_access_ok(state, addr + sp_offset, + REGS_SIZE - SP_OFFSET)) + return false; + + *sp = regs->sp; + } else + *sp = (unsigned long)®s->sp; + + return true; +} + +bool unwind_next_frame(struct unwind_state *state) +{ + enum stack_type prev_type = state->stack_info.type; + unsigned long ip_p, prev_sp = state->sp; + unsigned long cfa, orig_ip, orig_sp; + struct undwarf *undwarf; + struct pt_regs *ptregs; + bool indirect = false; + + if (unwind_done(state)) + return false; + + /* Don't let modules unload while we're reading their undwarf data. */ + preempt_disable(); + + /* Have we reached the end? */ + if (state->regs && user_mode(state->regs)) + goto done; + + /* + * Find the undwarf table entry associated with the text address. + * + * Decrement call return addresses by one so they work for sibling + * calls and calls to noreturn functions. + */ + undwarf = undwarf_find(state->signal ? state->ip : state->ip - 1); + if (!undwarf || undwarf->cfa_reg == UNDWARF_REG_UNDEFINED) + goto done; + orig_ip = state->ip; + + /* Calculate the CFA (caller frame address): */ + switch (undwarf->cfa_reg) { + case UNDWARF_REG_SP: + cfa = state->sp + undwarf->cfa_offset; + break; + + case UNDWARF_REG_BP: + cfa = state->bp + undwarf->cfa_offset; + break; + + case UNDWARF_REG_SP_INDIRECT: + cfa = state->sp + undwarf->cfa_offset; + indirect = true; + break; + + case UNDWARF_REG_BP_INDIRECT: + cfa = state->bp + undwarf->cfa_offset; + indirect = true; + break; + + case UNDWARF_REG_R10: + if (!state->regs || !state->full_regs) { + undwarf_warn("missing regs for base reg R10 at ip %p\n", + (void *)state->ip); + goto done; + } + cfa = state->regs->r10; + break; + + case UNDWARF_REG_R13: + if (!state->regs || !state->full_regs) { + undwarf_warn("missing regs for base reg R13 at ip %p\n", + (void *)state->ip); + goto done; + } + cfa = state->regs->r13; + break; + + case UNDWARF_REG_DI: + if (!state->regs || !state->full_regs) { + undwarf_warn("missing regs for base reg DI at ip %p\n", + (void *)state->ip); + goto done; + } + cfa = state->regs->di; + break; + + case UNDWARF_REG_DX: + if (!state->regs || !state->full_regs) { + undwarf_warn("missing regs for base reg DX at ip %p\n", + (void *)state->ip); + goto done; + } + cfa = state->regs->dx; + break; + + default: + undwarf_warn("unknown CFA base reg %d for ip %p\n", + undwarf->cfa_reg, (void *)state->ip); + goto done; + } + + if (indirect) { + if (!deref_stack_reg(state, cfa, &cfa)) + goto done; + } + + /* Find IP, SP and possibly regs: */ + switch (undwarf->type) { + case UNDWARF_TYPE_CFA: + ip_p = cfa - sizeof(long); + + if (!deref_stack_reg(state, ip_p, &state->ip)) + goto done; + + state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx, + state->ip, (void *)ip_p); + + state->sp = cfa; + state->regs = NULL; + state->signal = false; + break; + + case UNDWARF_TYPE_REGS: + if (!deref_stack_regs(state, cfa, &state->ip, &state->sp, true)) { + undwarf_warn("can't dereference registers at %p for ip %p\n", + (void *)cfa, (void *)orig_ip); + goto done; + } + + state->regs = (struct pt_regs *)cfa; + state->full_regs = true; + state->signal = true; + break; + + case UNDWARF_TYPE_REGS_IRET: + orig_sp = state->sp; + if (!deref_stack_regs(state, cfa, &state->ip, &state->sp, false)) { + undwarf_warn("can't dereference iret registers at %p for ip %p\n", + (void *)cfa, (void *)orig_ip); + goto done; + } + + ptregs = container_of((void *)cfa, struct pt_regs, ip); + if ((unsigned long)ptregs >= orig_sp && + on_stack(&state->stack_info, ptregs, REGS_SIZE)) { + state->regs = ptregs; + state->full_regs = false; + } else + state->regs = NULL; + + state->signal = true; + break; + + default: + undwarf_warn("unknown undwarf type %d\n", undwarf->type); + break; + } + + /* Find BP: */ + switch (undwarf->bp_reg) { + case UNDWARF_REG_UNDEFINED: + if (state->regs && state->full_regs) + state->bp = state->regs->bp; + break; + + case UNDWARF_REG_CFA: + if (!deref_stack_reg(state, cfa + undwarf->bp_offset,&state->bp)) + goto done; + break; + + case UNDWARF_REG_BP: + if (!deref_stack_reg(state, state->bp + undwarf->bp_offset, &state->bp)) + goto done; + break; + + default: + undwarf_warn("unknown BP base reg %d for ip %p\n", + undwarf->bp_reg, (void *)orig_ip); + goto done; + } + + /* Prevent a recursive loop due to bad undwarf data: */ + if (state->stack_info.type == prev_type && + on_stack(&state->stack_info, (void *)state->sp, sizeof(long)) && + state->sp <= prev_sp) { + undwarf_warn("stack going in the wrong direction? ip=%p\n", + (void *)orig_ip); + goto done; + } + + preempt_enable(); + return true; + +done: + preempt_enable(); + state->stack_info.type = STACK_TYPE_UNKNOWN; + return false; +} +EXPORT_SYMBOL_GPL(unwind_next_frame); + +void __unwind_start(struct unwind_state *state, struct task_struct *task, + struct pt_regs *regs, unsigned long *first_frame) +{ + memset(state, 0, sizeof(*state)); + state->task = task; + + /* + * Refuse to unwind the stack of a task while it's executing on another + * CPU. This check is racy, but that's ok: the unwinder has other + * checks to prevent it from going off the rails. + */ + if (task_on_another_cpu(task)) + goto done; + + if (regs) { + if (user_mode(regs)) + goto done; + + state->ip = regs->ip; + state->sp = kernel_stack_pointer(regs); + state->bp = regs->bp; + state->regs = regs; + state->full_regs = true; + state->signal = true; + + } else if (task == current) { + asm volatile("lea (%%rip), %0\n\t" + "mov %%rsp, %1\n\t" + "mov %%rbp, %2\n\t" + : "=r" (state->ip), "=r" (state->sp), + "=r" (state->bp)); + + } else { + struct inactive_task_frame *frame = (void *)task->thread.sp; + + state->ip = frame->ret_addr; + state->sp = task->thread.sp; + state->bp = frame->bp; + } + + if (get_stack_info((unsigned long *)state->sp, state->task, + &state->stack_info, &state->stack_mask)) + return; + + /* + * The caller can provide the address of the first frame directly + * (first_frame) or indirectly (regs->sp) to indicate which stack frame + * to start unwinding at. Skip ahead until we reach it. + */ + while (!unwind_done(state) && + (!on_stack(&state->stack_info, first_frame, sizeof(long)) || + state->sp <= (unsigned long)first_frame)) + unwind_next_frame(state); + + return; + +done: + state->stack_info.type = STACK_TYPE_UNKNOWN; + return; +} +EXPORT_SYMBOL_GPL(__unwind_start); diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index c8a3b61..e3b7cfc 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -148,6 +148,8 @@ SECTIONS BUG_TABLE + UNDWARF_TABLE + . = ALIGN(PAGE_SIZE); __vvar_page = .; diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 0d64658..a8ed616 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -668,6 +668,24 @@ #define BUG_TABLE #endif +#ifdef CONFIG_UNDWARF_UNWINDER +#define UNDWARF_TABLE \ + . = ALIGN(4); \ + .undwarf_ip : AT(ADDR(.undwarf_ip) - LOAD_OFFSET) { \ + VMLINUX_SYMBOL(__start_undwarf_ip) = .; \ + KEEP(*(.undwarf_ip)) \ + VMLINUX_SYMBOL(__stop_undwarf_ip) = .; \ + } \ + . = ALIGN(8); \ + .undwarf : AT(ADDR(.undwarf) - LOAD_OFFSET) { \ + VMLINUX_SYMBOL(__start_undwarf) = .; \ + KEEP(*(.undwarf)) \ + VMLINUX_SYMBOL(__stop_undwarf) = .; \ + } +#else +#define UNDWARF_TABLE +#endif + #ifdef CONFIG_PM_TRACE #define TRACEDATA \ . = ALIGN(4); \ @@ -854,7 +872,7 @@ DATA_DATA \ CONSTRUCTORS \ } \ - BUG_TABLE + BUG_TABLE \ #define INIT_TEXT_SECTION(inittext_align) \ . = ALIGN(inittext_align); \ diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 9c5d40a..ec79366 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -374,6 +374,9 @@ config STACK_VALIDATION pointers (if CONFIG_FRAME_POINTER is enabled). This helps ensure that runtime stack traces are more reliable. + This is also a prerequisite for creation of the undwarf format which + is needed for CONFIG_UNDWARF_UNWINDER. + For more information, see tools/objtool/Documentation/stack-validation.txt. diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 733e044..7859c79 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -258,7 +258,8 @@ ifneq ($(SKIP_STACK_VALIDATION),1) __objtool_obj := $(objtree)/tools/objtool/objtool -objtool_args = check +objtool_args = $(if $(CONFIG_UNDWARF_UNWINDER),undwarf generate,check) + ifndef CONFIG_FRAME_POINTER objtool_args += --no-fp endif @@ -276,6 +277,11 @@ objtool_obj = $(if $(patsubst y%,, \ endif # SKIP_STACK_VALIDATION endif # CONFIG_STACK_VALIDATION +# Rebuild all objects when objtool changes, or is enabled/disabled. +objtool_dep = $(objtool_obj) \ + $(wildcard include/config/undwarf/unwinder.h \ + include/config/stack/validation.h) + define rule_cc_o_c $(call echo-cmd,checksrc) $(cmd_checksrc) \ $(call cmd_and_fixdep,cc_o_c) \ @@ -298,13 +304,13 @@ cmd_undef_syms = echo endif # Built-in and composite module parts -$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE +$(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE $(call cmd,force_checksrc) $(call if_changed_rule,cc_o_c) # Single-part modules are special since we need to mark them in $(MODVERDIR) -$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_obj) FORCE +$(single-used-m): $(obj)/%.o: $(src)/%.c $(recordmcount_source) $(objtool_dep) FORCE $(call cmd,force_checksrc) $(call if_changed_rule,cc_o_c) @{ echo $(@:.o=.ko); echo $@; \ @@ -399,7 +405,7 @@ cmd_modversions_S = \ endif endif -$(obj)/%.o: $(src)/%.S $(objtool_obj) FORCE +$(obj)/%.o: $(src)/%.S $(objtool_dep) FORCE $(call if_changed_rule,as_o_S) targets += $(real-objs-y) $(real-objs-m) $(lib-y) -- 2.7.5 -- To unsubscribe from this list: send the line "unsubscribe live-patching" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html