On Tue, May 3, 2016 at 12:31 PM, Thomas Garnier <thgarnie@xxxxxxxxxx> wrote: > Randomizes the virtual address space of kernel memory sections (physical > memory mapping, vmalloc & vmemmap) for x86_64. This security feature > mitigates exploits relying on predictable kernel addresses. These > addresses can be used to disclose the kernel modules base addresses or > corrupt specific structures to elevate privileges bypassing the current > implementation of KASLR. This feature can be enabled with the > CONFIG_RANDOMIZE_MEMORY option. I'm struggling to come up with a more accurate name for this, since it's a base randomization of the kernel memory sections. Everything else seems needlessly long (CONFIG_RANDOMIZE_BASE_MEMORY). I wonder if things should be renamed generally to CONFIG_KASLR_BASE, CONFIG_KASLR_MEMORY, etc, but that doesn't need to be part of this series. Let's leave this as-is, and just make sure it's clear in the Kconfig. > The physical memory mapping holds most allocations from boot and heap > allocators. Knowning the base address and physical memory size, an > attacker can deduce the PDE virtual address for the vDSO memory page. > This attack was demonstrated at CanSecWest 2016, in the "Getting Physical Extreme Abuse of Intel Based Paged Systems" > https://goo.gl/ANpWdV (see second part of the presentation). The > exploits used against Linux worked successfuly against 4.6+ but fail > with KASLR memory enabled (https://goo.gl/iTtXMJ). Similar research > was done at Google leading to this patch proposal. Variants exists to > overwrite /proc or /sys objects ACLs leading to elevation of privileges. > These variants were testeda against 4.6+. Typo "tested". > > The vmalloc memory section contains the allocation made through the > vmalloc api. The allocations are done sequentially to prevent > fragmentation and each allocation address can easily be deduced > especially from boot. > > The vmemmap section holds a representation of the physical > memory (through a struct page array). An attacker could use this section > to disclose the kernel memory layout (walking the page linked list). > > The order of each memory section is not changed. The feature looks at > the available space for the sections based on different configuration > options and randomizes the base and space between each. The size of the > physical memory mapping is the available physical memory. No performance > impact was detected while testing the feature. > > Entropy is generated using the KASLR early boot functions now shared in > the lib directory (originally written by Kees Cook). Randomization is > done on PGD & PUD page table levels to increase possible addresses. The > physical memory mapping code was adapted to support PUD level virtual > addresses. An additional low memory page is used to ensure each CPU can > start with a PGD aligned virtual address (for realmode). > > x86/dump_pagetable was updated to correctly display each section. > > Updated documentation on x86_64 memory layout accordingly. > > Performance data: > > Kernbench shows almost no difference (-+ less than 1%): > > Before: > > Average Optimal load -j 12 Run (std deviation): > Elapsed Time 102.63 (1.2695) > User Time 1034.89 (1.18115) > System Time 87.056 (0.456416) > Percent CPU 1092.9 (13.892) > Context Switches 199805 (3455.33) > Sleeps 97907.8 (900.636) > > After: > > Average Optimal load -j 12 Run (std deviation): > Elapsed Time 102.489 (1.10636) > User Time 1034.86 (1.36053) > System Time 87.764 (0.49345) > Percent CPU 1095 (12.7715) > Context Switches 199036 (4298.1) > Sleeps 97681.6 (1031.11) > > Hackbench shows 0% difference on average (hackbench 90 > repeated 10 times): > > attemp,before,after > 1,0.076,0.069 > 2,0.072,0.069 > 3,0.066,0.066 > 4,0.066,0.068 > 5,0.066,0.067 > 6,0.066,0.069 > 7,0.067,0.066 > 8,0.063,0.067 > 9,0.067,0.065 > 10,0.068,0.071 > average,0.0677,0.0677 > > Signed-off-by: Thomas Garnier <thgarnie@xxxxxxxxxx> > --- > Based on next-20160502 > --- > Documentation/x86/x86_64/mm.txt | 4 + > arch/x86/Kconfig | 15 ++++ > arch/x86/include/asm/kaslr.h | 12 +++ > arch/x86/include/asm/page_64_types.h | 11 ++- > arch/x86/include/asm/pgtable_64.h | 1 + > arch/x86/include/asm/pgtable_64_types.h | 15 +++- > arch/x86/kernel/head_64.S | 2 +- > arch/x86/kernel/setup.c | 3 + > arch/x86/mm/Makefile | 1 + > arch/x86/mm/dump_pagetables.c | 11 ++- > arch/x86/mm/init.c | 4 + > arch/x86/mm/kaslr.c | 136 ++++++++++++++++++++++++++++++++ > arch/x86/realmode/init.c | 4 + > 13 files changed, 211 insertions(+), 8 deletions(-) > create mode 100644 arch/x86/mm/kaslr.c > > diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt > index 5aa7383..602a52d 100644 > --- a/Documentation/x86/x86_64/mm.txt > +++ b/Documentation/x86/x86_64/mm.txt > @@ -39,4 +39,8 @@ memory window (this size is arbitrary, it can be raised later if needed). > The mappings are not part of any other kernel PGD and are only available > during EFI runtime calls. > > +Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all > +physical memory, vmalloc/ioremap space and virtual memory map are randomized. > +Their order is preserved but their base will be changed early at boot time. Maybe instead of "changed", say "offset"? > + > -Andi Kleen, Jul 2004 > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 0b128b4..60f33c7 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -1988,6 +1988,21 @@ config PHYSICAL_ALIGN > > Don't change this unless you know what you are doing. > > +config RANDOMIZE_MEMORY > + bool "Randomize the kernel memory sections" > + depends on X86_64 > + depends on RANDOMIZE_BASE Does this actually _depend_ on RANDOMIZE_BASE? It needs the lib/kaslr.c code, but this could operate without the kernel base address having been randomized, correct? > + default n As such, maybe the default should be: default RANDOMIZE_BASE > + ---help--- > + Randomizes the virtual address of memory sections (physical memory How about: Randomizes the base virtual address of kernel memory sections ... > + mapping, vmalloc & vmemmap). This security feature mitigates exploits > + relying on predictable memory locations. And "This security feature makes exploits relying on predictable memory locations less reliable." ? > + > + Base and padding between memory section is randomized. Their order is > + not. Entropy is generated in the same way as RANDOMIZE_BASE. Since base would be mentioned above and padding is separate, I'd change this to: The order of allocations remains unchanged. Entropy is generated ... > + > + If unsure, say N. > + > config HOTPLUG_CPU > bool "Support for hot-pluggable CPUs" > depends on SMP > diff --git a/arch/x86/include/asm/kaslr.h b/arch/x86/include/asm/kaslr.h > index 2ae1429..12c7742 100644 > --- a/arch/x86/include/asm/kaslr.h > +++ b/arch/x86/include/asm/kaslr.h > @@ -3,4 +3,16 @@ > > unsigned long kaslr_get_random_boot_long(void); > > +#ifdef CONFIG_RANDOMIZE_MEMORY > +extern unsigned long page_offset_base; > +extern unsigned long vmalloc_base; > +extern unsigned long vmemmap_base; > + > +void kernel_randomize_memory(void); > +void kaslr_trampoline_init(void); > +#else > +static inline void kernel_randomize_memory(void) { } > +static inline void kaslr_trampoline_init(void) { } > +#endif /* CONFIG_RANDOMIZE_MEMORY */ > + > #endif > diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h > index d5c2f8b..9215e05 100644 > --- a/arch/x86/include/asm/page_64_types.h > +++ b/arch/x86/include/asm/page_64_types.h > @@ -1,6 +1,10 @@ > #ifndef _ASM_X86_PAGE_64_DEFS_H > #define _ASM_X86_PAGE_64_DEFS_H > > +#ifndef __ASSEMBLY__ > +#include <asm/kaslr.h> > +#endif > + > #ifdef CONFIG_KASAN > #define KASAN_STACK_ORDER 1 > #else > @@ -32,7 +36,12 @@ > * hypervisor to fit. Choosing 16 slots here is arbitrary, but it's > * what Xen requires. > */ > -#define __PAGE_OFFSET _AC(0xffff880000000000, UL) > +#define __PAGE_OFFSET_BASE _AC(0xffff880000000000, UL) > +#ifdef CONFIG_RANDOMIZE_MEMORY > +#define __PAGE_OFFSET page_offset_base > +#else > +#define __PAGE_OFFSET __PAGE_OFFSET_BASE > +#endif /* CONFIG_RANDOMIZE_MEMORY */ > > #define __START_KERNEL_map _AC(0xffffffff80000000, UL) > > diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h > index 2ee7811..0dfec89 100644 > --- a/arch/x86/include/asm/pgtable_64.h > +++ b/arch/x86/include/asm/pgtable_64.h > @@ -21,6 +21,7 @@ extern pmd_t level2_fixmap_pgt[512]; > extern pmd_t level2_ident_pgt[512]; > extern pte_t level1_fixmap_pgt[512]; > extern pgd_t init_level4_pgt[]; > +extern pgd_t trampoline_pgd_entry; > > #define swapper_pg_dir init_level4_pgt > > diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h > index e6844df..d388739 100644 > --- a/arch/x86/include/asm/pgtable_64_types.h > +++ b/arch/x86/include/asm/pgtable_64_types.h > @@ -5,6 +5,7 @@ > > #ifndef __ASSEMBLY__ > #include <linux/types.h> > +#include <asm/kaslr.h> > > /* > * These are used to make use of C type-checking.. > @@ -54,9 +55,17 @@ typedef struct { pteval_t pte; } pte_t; > > /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */ > #define MAXMEM _AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL) > -#define VMALLOC_START _AC(0xffffc90000000000, UL) > -#define VMALLOC_END _AC(0xffffe8ffffffffff, UL) > -#define VMEMMAP_START _AC(0xffffea0000000000, UL) > +#define VMALLOC_SIZE_TB _AC(32, UL) > +#define __VMALLOC_BASE _AC(0xffffc90000000000, UL) > +#define __VMEMMAP_BASE _AC(0xffffea0000000000, UL) > +#ifdef CONFIG_RANDOMIZE_MEMORY > +#define VMALLOC_START vmalloc_base > +#define VMEMMAP_START vmemmap_base > +#else > +#define VMALLOC_START __VMALLOC_BASE > +#define VMEMMAP_START __VMEMMAP_BASE > +#endif /* CONFIG_RANDOMIZE_MEMORY */ > +#define VMALLOC_END (VMALLOC_START + _AC((VMALLOC_SIZE_TB << 40) - 1, UL)) > #define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE) > #define MODULES_END _AC(0xffffffffff000000, UL) > #define MODULES_LEN (MODULES_END - MODULES_VADDR) > diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S > index 5df831e..03a2aa0 100644 > --- a/arch/x86/kernel/head_64.S > +++ b/arch/x86/kernel/head_64.S > @@ -38,7 +38,7 @@ > > #define pud_index(x) (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1)) > > -L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET) > +L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE) > L4_START_KERNEL = pgd_index(__START_KERNEL_map) > L3_START_KERNEL = pud_index(__START_KERNEL_map) > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > index c4e7b39..a261658 100644 > --- a/arch/x86/kernel/setup.c > +++ b/arch/x86/kernel/setup.c > @@ -113,6 +113,7 @@ > #include <asm/prom.h> > #include <asm/microcode.h> > #include <asm/mmu_context.h> > +#include <asm/kaslr.h> > > /* > * max_low_pfn_mapped: highest direct mapped pfn under 4GB > @@ -942,6 +943,8 @@ void __init setup_arch(char **cmdline_p) > > x86_init.oem.arch_setup(); > > + kernel_randomize_memory(); > + > iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1; > setup_memory_map(); > parse_setup_data(); > diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile > index 62c0043..96d2b84 100644 > --- a/arch/x86/mm/Makefile > +++ b/arch/x86/mm/Makefile > @@ -37,4 +37,5 @@ obj-$(CONFIG_NUMA_EMU) += numa_emulation.o > > obj-$(CONFIG_X86_INTEL_MPX) += mpx.o > obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o > +obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o > > diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c > index 99bfb19..4a03f60 100644 > --- a/arch/x86/mm/dump_pagetables.c > +++ b/arch/x86/mm/dump_pagetables.c > @@ -72,9 +72,9 @@ static struct addr_marker address_markers[] = { > { 0, "User Space" }, > #ifdef CONFIG_X86_64 > { 0x8000000000000000UL, "Kernel Space" }, > - { PAGE_OFFSET, "Low Kernel Mapping" }, > - { VMALLOC_START, "vmalloc() Area" }, > - { VMEMMAP_START, "Vmemmap" }, > + { 0/* PAGE_OFFSET */, "Low Kernel Mapping" }, > + { 0/* VMALLOC_START */, "vmalloc() Area" }, > + { 0/* VMEMMAP_START */, "Vmemmap" }, > # ifdef CONFIG_X86_ESPFIX64 > { ESPFIX_BASE_ADDR, "ESPfix Area", 16 }, > # endif > @@ -434,6 +434,11 @@ void ptdump_walk_pgd_level_checkwx(void) > > static int __init pt_dump_init(void) > { > +#ifdef CONFIG_X86_64 > + address_markers[LOW_KERNEL_NR].start_address = PAGE_OFFSET; > + address_markers[VMALLOC_START_NR].start_address = VMALLOC_START; > + address_markers[VMEMMAP_START_NR].start_address = VMEMMAP_START; > +#endif > #ifdef CONFIG_X86_32 > /* Not a compile-time constant on x86-32 */ I'd move this comment above you new ifdef and generalize it to something like: /* Various markers are not compile-time constants, so assign them here. */ > address_markers[VMALLOC_START_NR].start_address = VMALLOC_START; > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c > index 372aad2..e490624 100644 > --- a/arch/x86/mm/init.c > +++ b/arch/x86/mm/init.c > @@ -17,6 +17,7 @@ > #include <asm/proto.h> > #include <asm/dma.h> /* for MAX_DMA_PFN */ > #include <asm/microcode.h> > +#include <asm/kaslr.h> > > /* > * We need to define the tracepoints somewhere, and tlb.c > @@ -590,6 +591,9 @@ void __init init_mem_mapping(void) > /* the ISA range is always mapped regardless of memory holes */ > init_memory_mapping(0, ISA_END_ADDRESS); > > + /* Init the trampoline page table if needed for KASLR memory */ > + kaslr_trampoline_init(); > + > /* > * If the allocation is in bottom-up direction, we setup direct mapping > * in bottom-up, otherwise we setup direct mapping in top-down. > diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c > new file mode 100644 > index 0000000..3b330a9 > --- /dev/null > +++ b/arch/x86/mm/kaslr.c > @@ -0,0 +1,136 @@ > +#include <linux/kernel.h> > +#include <linux/errno.h> > +#include <linux/types.h> > +#include <linux/mm.h> > +#include <linux/smp.h> > +#include <linux/init.h> > +#include <linux/memory.h> > +#include <linux/random.h> > + > +#include <asm/processor.h> > +#include <asm/pgtable.h> > +#include <asm/pgalloc.h> > +#include <asm/e820.h> > +#include <asm/init.h> > +#include <asm/setup.h> > +#include <asm/kaslr.h> > +#include <asm/kasan.h> > + > +#include "mm_internal.h" > + > +/* Hold the pgd entry used on booting additional CPUs */ > +pgd_t trampoline_pgd_entry; > + > +#define TB_SHIFT 40 > + > +/* > + * Memory base and end randomization is based on different configurations. > + * We want as much space as possible to increase entropy available. > + */ > +static const unsigned long memory_rand_start = __PAGE_OFFSET_BASE; > + > +#if defined(CONFIG_KASAN) > +static const unsigned long memory_rand_end = KASAN_SHADOW_START; > +#elif defined(CONFIG_X86_ESPFIX64) > +static const unsigned long memory_rand_end = ESPFIX_BASE_ADDR; > +#elif defined(CONFIG_EFI) > +static const unsigned long memory_rand_end = EFI_VA_START; > +#else > +static const unsigned long memory_rand_end = __START_KERNEL_map; > +#endif Is it worth adding BUILD_BUG_ON()s to verify these values stay in decreasing size? > + > +/* Default values */ > +unsigned long page_offset_base = __PAGE_OFFSET_BASE; > +EXPORT_SYMBOL(page_offset_base); > +unsigned long vmalloc_base = __VMALLOC_BASE; > +EXPORT_SYMBOL(vmalloc_base); > +unsigned long vmemmap_base = __VMEMMAP_BASE; > +EXPORT_SYMBOL(vmemmap_base); > + > +/* Describe each randomized memory sections in sequential order */ > +static struct kaslr_memory_region { > + unsigned long *base; > + unsigned short size_tb; > +} kaslr_regions[] = { > + { &page_offset_base, 64/* Maximum */ }, > + { &vmalloc_base, VMALLOC_SIZE_TB }, > + { &vmemmap_base, 1 }, > +}; This seems to be __init_data, since it's only used in kernel_randomize_memory()? > + > +/* Size in Terabytes + 1 hole */ > +static inline unsigned long get_padding(struct kaslr_memory_region *region) I think this can be marked __init also? > +{ > + return ((unsigned long)region->size_tb + 1) << TB_SHIFT; > +} > + > +/* Initialize base and padding for each memory section randomized with KASLR */ > +void __init kernel_randomize_memory(void) > +{ > + size_t i; > + unsigned long addr = memory_rand_start; > + unsigned long padding, rand, mem_tb; > + struct rnd_state rnd_st; > + unsigned long remain_padding = memory_rand_end - memory_rand_start; > + > + if (!kaslr_enabled()) > + return; > + > + BUG_ON(kaslr_regions[0].base != &page_offset_base); This is statically assigned above, is this BUG_ON useful? > + mem_tb = ((max_pfn << PAGE_SHIFT) >> TB_SHIFT); > + > + if (mem_tb < kaslr_regions[0].size_tb) > + kaslr_regions[0].size_tb = mem_tb; Can you add a comment for this? IIUC, this is just recalculating the max memory size available for padding based on the page shift? Under what situations would this be changing? > + > + for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++) > + remain_padding -= get_padding(&kaslr_regions[i]); > + > + prandom_seed_state(&rnd_st, kaslr_get_random_boot_long()); > + > + /* Position each section randomly with minimum 1 terabyte between */ > + for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++) { > + padding = remain_padding / (ARRAY_SIZE(kaslr_regions) - i); > + prandom_bytes_state(&rnd_st, &rand, sizeof(rand)); > + padding = (rand % (padding + 1)) & PUD_MASK; > + addr += padding; > + *kaslr_regions[i].base = addr; > + addr += get_padding(&kaslr_regions[i]); > + remain_padding -= padding; > + } What happens if we run out of padding here, and doesn't this loop mean earlier regions will have, on average, more padding? Should each instead randomize within a one-time calculation of remaining_padding / ARRAY_SIZE(kaslr_regions) ? Also, to get added to the Kconfig, what is the available entropy here? How far can each of the base addresses get offset? > +} > + > +/* > + * Create PGD aligned trampoline table to allow real mode initialization > + * of additional CPUs. Consume only 1 additonal low memory page. Typo "additional". > + */ > +void __meminit kaslr_trampoline_init(void) > +{ > + unsigned long addr, next; > + pgd_t *pgd; > + pud_t *pud_page, *tr_pud_page; > + int i; > + > + /* If KASLR is disabled, default to the existing page table entry */ > + if (!kaslr_enabled()) { > + trampoline_pgd_entry = init_level4_pgt[pgd_index(PAGE_OFFSET)]; > + return; > + } > + > + tr_pud_page = alloc_low_page(); > + set_pgd(&trampoline_pgd_entry, __pgd(_PAGE_TABLE | __pa(tr_pud_page))); > + > + addr = 0; > + pgd = pgd_offset_k((unsigned long)__va(addr)); > + pud_page = (pud_t *) pgd_page_vaddr(*pgd); > + > + for (i = pud_index(addr); i < PTRS_PER_PUD; i++, addr = next) { > + pud_t *pud, *tr_pud; > + > + tr_pud = tr_pud_page + pud_index(addr); > + pud = pud_page + pud_index((unsigned long)__va(addr)); > + next = (addr & PUD_MASK) + PUD_SIZE; > + > + /* Needed to copy pte or pud alike */ > + BUILD_BUG_ON(sizeof(pud_t) != sizeof(pte_t)); > + *tr_pud = *pud; > + } > +} > diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c > index 0b7a63d..6518314 100644 > --- a/arch/x86/realmode/init.c > +++ b/arch/x86/realmode/init.c > @@ -84,7 +84,11 @@ void __init setup_real_mode(void) > *trampoline_cr4_features = __read_cr4(); > > trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd); > +#ifdef CONFIG_RANDOMIZE_MEMORY > + trampoline_pgd[0] = trampoline_pgd_entry.pgd; > +#else > trampoline_pgd[0] = init_level4_pgt[pgd_index(__PAGE_OFFSET)].pgd; > +#endif To avoid this ifdefs, could trampoline_pgd_entry instead be defined outside of mm/kaslr.c and have .pgd assigned as init_level4_pgt[pgd_index(__PAGE_OFFSET)].pgd via a static inline of kaslr_trampoline_init() instead? > trampoline_pgd[511] = init_level4_pgt[511].pgd; > #endif > } > -- > 2.8.0.rc3.226.g39d4020 > -Kees -- Kees Cook Chrome OS & Brillo Security -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html