On Mon, Feb 11, 2019 at 10:05 AM Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > > > On Feb 10, 2019, at 9:18 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: > > > > > > > > On Feb 10, 2019, at 4:39 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > > > >>> On Jan 28, 2019, at 4:34 PM, Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx> wrote: > >>> > >>> From: Nadav Amit <namit@xxxxxxxxxx> > >>> > >>> To prevent improper use of the PTEs that are used for text patching, we > >>> want to use a temporary mm struct. We initailize it by copying the init > >>> mm. > >>> > >>> The address that will be used for patching is taken from the lower area > >>> that is usually used for the task memory. Doing so prevents the need to > >>> frequently synchronize the temporary-mm (e.g., when BPF programs are > >>> installed), since different PGDs are used for the task memory. > >>> > >>> Finally, we randomize the address of the PTEs to harden against exploits > >>> that use these PTEs. > >>> > >>> Cc: Kees Cook <keescook@xxxxxxxxxxxx> > >>> Cc: Dave Hansen <dave.hansen@xxxxxxxxx> > >>> Acked-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> > >>> Reviewed-by: Masami Hiramatsu <mhiramat@xxxxxxxxxx> > >>> Tested-by: Masami Hiramatsu <mhiramat@xxxxxxxxxx> > >>> Suggested-by: Andy Lutomirski <luto@xxxxxxxxxx> > >>> Signed-off-by: Nadav Amit <namit@xxxxxxxxxx> > >>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx> > >>> --- > >>> arch/x86/include/asm/pgtable.h | 3 +++ > >>> arch/x86/include/asm/text-patching.h | 2 ++ > >>> arch/x86/kernel/alternative.c | 3 +++ > >>> arch/x86/mm/init_64.c | 36 ++++++++++++++++++++++++++++ > >>> init/main.c | 3 +++ > >>> 5 files changed, 47 insertions(+) > >>> > >>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h > >>> index 40616e805292..e8f630d9a2ed 100644 > >>> --- a/arch/x86/include/asm/pgtable.h > >>> +++ b/arch/x86/include/asm/pgtable.h > >>> @@ -1021,6 +1021,9 @@ static inline void __meminit init_trampoline_default(void) > >>> /* Default trampoline pgd value */ > >>> trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)]; > >>> } > >>> + > >>> +void __init poking_init(void); > >>> + > >>> # ifdef CONFIG_RANDOMIZE_MEMORY > >>> void __meminit init_trampoline(void); > >>> # else > >>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h > >>> index f8fc8e86cf01..a75eed841eed 100644 > >>> --- a/arch/x86/include/asm/text-patching.h > >>> +++ b/arch/x86/include/asm/text-patching.h > >>> @@ -39,5 +39,7 @@ extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); > >>> extern int poke_int3_handler(struct pt_regs *regs); > >>> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler); > >>> extern int after_bootmem; > >>> +extern __ro_after_init struct mm_struct *poking_mm; > >>> +extern __ro_after_init unsigned long poking_addr; > >>> > >>> #endif /* _ASM_X86_TEXT_PATCHING_H */ > >>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > >>> index 12fddbc8c55b..ae05fbb50171 100644 > >>> --- a/arch/x86/kernel/alternative.c > >>> +++ b/arch/x86/kernel/alternative.c > >>> @@ -678,6 +678,9 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode, > >>> return addr; > >>> } > >>> > >>> +__ro_after_init struct mm_struct *poking_mm; > >>> +__ro_after_init unsigned long poking_addr; > >>> + > >>> static void *__text_poke(void *addr, const void *opcode, size_t len) > >>> { > >>> unsigned long flags; > >>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c > >>> index bccff68e3267..125c8c48aa24 100644 > >>> --- a/arch/x86/mm/init_64.c > >>> +++ b/arch/x86/mm/init_64.c > >>> @@ -53,6 +53,7 @@ > >>> #include <asm/init.h> > >>> #include <asm/uv/uv.h> > >>> #include <asm/setup.h> > >>> +#include <asm/text-patching.h> > >>> > >>> #include "mm_internal.h" > >>> > >>> @@ -1383,6 +1384,41 @@ unsigned long memory_block_size_bytes(void) > >>> return memory_block_size_probed; > >>> } > >>> > >>> +/* > >>> + * Initialize an mm_struct to be used during poking and a pointer to be used > >>> + * during patching. > >>> + */ > >>> +void __init poking_init(void) > >>> +{ > >>> + spinlock_t *ptl; > >>> + pte_t *ptep; > >>> + > >>> + poking_mm = copy_init_mm(); > >>> + BUG_ON(!poking_mm); > >>> + > >>> + /* > >>> + * Randomize the poking address, but make sure that the following page > >>> + * will be mapped at the same PMD. We need 2 pages, so find space for 3, > >>> + * and adjust the address if the PMD ends after the first one. > >>> + */ > >>> + poking_addr = TASK_UNMAPPED_BASE; > >>> + if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) > >>> + poking_addr += (kaslr_get_random_long("Poking") & PAGE_MASK) % > >>> + (TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE); > >>> + > >>> + if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0) > >>> + poking_addr += PAGE_SIZE; > >> > >> Further thinking about it, I think that allocating the virtual address for > >> poking from user address-range is problematic. The user can set watchpoints > >> on different addresses, cause some static-keys to be enabled/disabled, and > >> monitor the signals to derandomize the poking address. > > > > Hmm, I hadn’t thought about watchpoints. I’m not sure how much we care > > about possible derandomization like this, but we certainly don’t want to > > send signals or otherwise malfunction. > > > >> Andy, I think you were pushing this change. Can I go back to use a vmalloc’d > >> address instead, or do you have a better solution? > > > > Hmm. If we use a vmalloc address, we have to make sure it’s not actually > > allocated. I suppose we could allocate one once at boot and use that. We > > also have the problem that the usual APIs for handling “user” addresses > > might assume they’re actually in the user range, although this seems > > unlikely to be a problem in practice. More seriously, though, the code > > that manipulates per-mm paging structures assumes that *all* of the > > structures up to the top level are per-mm, and, if we use anything less > > than a private pgd, this isn’t the case. > > I forgot that I only had this conversation in my mind ;-) > > Well, I did write some code that kept some vmalloc’d area private, and it > did require more synchronization between the pgd’s. It is still possible > to use another top-level PGD, but … (continued below) > > > > >> I prefer not to > >> save/restore DR7, of course. > > > > I suspect we may want to use the temporary mm concept for EFI, too, so we > > may want to just suck it up and save/restore DR7. But only if a watchpoint > > is in use, of course. I have an old patch I could dust off that tracks DR7 > > to make things like this efficient. > > … but, if this is the case, then I will just make (un)use_temporary_mm() to > save/restore DR7. I guess you are ok with such a solution. I will > incorporate it into Rick’s v3. > I'm certainly amenable to other solutions, but this one does seem the least messy. I looked at my old patch, and it doesn't do what you want. I'd suggest you just add a percpu variable like cpu_dr7 and rig up some accessors so that it stays up to date. Then you can skip the dr7 writes if there are no watchpoints set. Also, EFI is probably a less interesting example than rare_write. With rare_write, especially the dynamically allocated variants that people keep coming up with, we'll need a swath of address space fully as large as the vmalloc area. and getting *that* right while still using the kernel address range might be more of a mess than we really want to deal with. --Andy