On Fri, 2 Oct 2020 at 07:45, Jann Horn <jannh@xxxxxxxxxx> wrote: > > On Tue, Sep 29, 2020 at 3:38 PM Marco Elver <elver@xxxxxxxxxx> wrote: > > Add architecture specific implementation details for KFENCE and enable > > KFENCE for the x86 architecture. In particular, this implements the > > required interface in <asm/kfence.h> for setting up the pool and > > providing helper functions for protecting and unprotecting pages. > > > > For x86, we need to ensure that the pool uses 4K pages, which is done > > using the set_memory_4k() helper function. > [...] > > diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h > [...] > > +/* Protect the given page and flush TLBs. */ > > +static inline bool kfence_protect_page(unsigned long addr, bool protect) > > +{ > > + unsigned int level; > > + pte_t *pte = lookup_address(addr, &level); > > + > > + if (!pte || level != PG_LEVEL_4K) > > Do we actually expect this to happen, or is this just a "robustness" > check? If we don't expect this to happen, there should be a WARN_ON() > around the condition. It's not obvious here, but we already have this covered with a WARN: the core.c code has a KFENCE_WARN_ON, which disables KFENCE on a warning. > > + return false; > > + > > + if (protect) > > + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); > > + else > > + set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); > > Hmm... do we have this helper (instead of using the existing helpers > for modifying memory permissions) to work around the allocation out of > the data section? I just played around with using the set_memory.c functions, to remind myself why this didn't work. I experimented with using set_memory_{np,p}() functions; set_memory_p() isn't implemented, but is easily added (which I did for below experiment). However, this didn't quite work: WARNING: CPU: 6 PID: 107 at kernel/smp.c:490 smp_call_function_many_cond+0x9c/0x2a0 kernel/smp.c:490 [...] Call Trace: smp_call_function_many kernel/smp.c:577 [inline] smp_call_function kernel/smp.c:599 [inline] on_each_cpu+0x3e/0x90 kernel/smp.c:698 __purge_vmap_area_lazy+0x58/0x670 mm/vmalloc.c:1352 _vm_unmap_aliases.part.0+0x10b/0x140 mm/vmalloc.c:1770 change_page_attr_set_clr+0xb4/0x1c0 arch/x86/mm/pat/set_memory.c:1732 change_page_attr_set arch/x86/mm/pat/set_memory.c:1782 [inline] set_memory_p+0x21/0x30 arch/x86/mm/pat/set_memory.c:1950 kfence_protect_page arch/x86/include/asm/kfence.h:55 [inline] kfence_protect_page arch/x86/include/asm/kfence.h:43 [inline] kfence_unprotect+0x42/0x70 mm/kfence/core.c:139 no_context+0x115/0x300 arch/x86/mm/fault.c:705 handle_page_fault arch/x86/mm/fault.c:1431 [inline] exc_page_fault+0xa7/0x170 arch/x86/mm/fault.c:1486 asm_exc_page_fault+0x1e/0x30 arch/x86/include/asm/idtentry.h:538 For one, smp_call_function_many_cond() doesn't want to be called with interrupts disabled, and we may very well get a KFENCE allocation or page fault with interrupts disabled / within interrupts. Therefore, to be safe, we should avoid IPIs. It follows that setting the page attribute is best-effort, and we can tolerate some inaccuracy. Lazy fault handling should take care of faults after we set the page as PRESENT. Which hopefully also answers your other comment: > flush_tlb_one_kernel() -> flush_tlb_one_user() -> > __flush_tlb_one_user() -> native_flush_tlb_one_user() only flushes on > the local CPU core, not on others. If you want to leave it this way, I > think this needs a comment explaining why we're not doing a global > flush (locking context / performance overhead / ... ?). We'll add a comment to clarify why it's done this way. > > + flush_tlb_one_kernel(addr); > > + return true; > > +} > > + > > +#endif /* _ASM_X86_KFENCE_H */ > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > [...] > > @@ -701,6 +702,9 @@ no_context(struct pt_regs *regs, unsigned long error_code, > > } > > #endif > > > > + if (kfence_handle_page_fault(address)) > > + return; > > + > > /* > > * 32-bit: > > * > > The standard 5 lines of diff context don't really make it obvious > what's going on here. Here's a diff with more context: > > > /* > * Stack overflow? During boot, we can fault near the initial > * stack in the direct map, but that's not an overflow -- check > * that we're in vmalloc space to avoid this. > */ > if (is_vmalloc_addr((void *)address) && > (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) || > address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) { > unsigned long stack = __this_cpu_ist_top_va(DF) - > sizeof(void *); > /* > * We're likely to be running with very little stack space > * left. It's plausible that we'd hit this condition but > * double-fault even before we get this far, in which case > * we're fine: the double-fault handler will deal with it. > * > * We don't want to make it all the way into the oops code > * and then double-fault, though, because we're likely to > * break the console driver and lose most of the stack dump. > */ > asm volatile ("movq %[stack], %%rsp\n\t" > "call handle_stack_overflow\n\t" > "1: jmp 1b" > : ASM_CALL_CONSTRAINT > : "D" ("kernel stack overflow (page fault)"), > "S" (regs), "d" (address), > [stack] "rm" (stack)); > unreachable(); > } > #endif > > + if (kfence_handle_page_fault(address)) > + return; > + > /* > * 32-bit: > * > * Valid to do another page fault here, because if this fault > * had been triggered by is_prefetch fixup_exception would have > * handled it. > * > * 64-bit: > * > * Hall of shame of CPU/BIOS bugs. > */ > if (is_prefetch(regs, error_code, address)) > return; > > if (is_errata93(regs, address)) > return; > > /* > * Buggy firmware could access regions which might page fault, try to > * recover from such faults. > */ > if (IS_ENABLED(CONFIG_EFI)) > efi_recover_from_page_fault(address); > > oops: > /* > * Oops. The kernel tried to access some bad page. We'll have to > * terminate things with extreme prejudice: > */ > flags = oops_begin(); > > > > Shouldn't kfence_handle_page_fault() happen after prefetch handling, > at least? Maybe directly above the "oops" label? Good question. AFAIK it doesn't matter, as is_kfence_address() should never apply for any of those that follow, right? In any case, it shouldn't hurt to move it down. Thanks, -- Marco