> On Jan 7, 2020, at 9:01 AM, Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx> wrote: > > CC Nadav and Jessica. > > On Mon, 2020-01-06 at 15:36 -1000, Andy Lutomirski wrote: >>> On Jan 6, 2020, at 12:25 PM, Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx> >>> wrote: >>> >>> On Sat, 2020-01-04 at 09:49 +0900, Andy Lutomirski wrote: >>>>>>> On Jan 4, 2020, at 8:47 AM, KP Singh <kpsingh@xxxxxxxxxxxx> wrote: >>>>>> >>>>>> From: KP Singh <kpsingh@xxxxxxxxxx> >>>>>> >>>>>> The image for the BPF trampolines is allocated with >>>>>> bpf_jit_alloc_exe_page which marks this allocated page executable. This >>>>>> means that the allocated memory is W and X at the same time making it >>>>>> susceptible to WX based attacks. >>>>>> >>>>>> Since the allocated memory is shared between two trampolines (the >>>>>> current and the next), 2 pages must be allocated to adhere to W^X and >>>>>> the following sequence is obeyed where trampolines are modified: >>>>> >>>>> Can we please do better rather than piling garbage on top of garbage? >>>>> >>>>>> >>>>>> - Mark memory as non executable (set_memory_nx). While module_alloc for >>>>>> x86 allocates the memory as PAGE_KERNEL and not PAGE_KERNEL_EXEC, not >>>>>> all implementations of module_alloc do so >>>>> >>>>> How about fixing this instead? >>>>> >>>>>> - Mark the memory as read/write (set_memory_rw) >>>>> >>>>> Probably harmless, but see above about fixing it. >>>>> >>>>>> - Modify the trampoline >>>>> >>>>> Seems reasonable. It’s worth noting that this whole approach is >>>>> suboptimal: >>>>> the “module” allocator should really be returning a list of pages to be >>>>> written (not at the final address!) with the actual executable mapping to >>>>> be >>>>> materialized later, but that’s a bigger project that you’re welcome to >>>>> ignore >>>>> for now. (Concretely, it should produce a vmap address with backing pages >>>>> but >>>>> with the vmap alias either entirely unmapped or read-only. A subsequent >>>>> healer >>>>> would, all at once, make the direct map pages RO or not-present and make >>>>> the >>>>> vmap alias RX.) >>>>>> - Mark the memory as read-only (set_memory_ro) >>>>>> - Mark the memory as executable (set_memory_x) >>>>> >>>>> No, thanks. There’s very little excuse for doing two IPI flushes when one >>>>> would suffice. >>>>> >>>>> As far as I know, all architectures can do this with a single flush >>>>> without >>>>> races x86 certainly can. The module freeing code gets this sequence >>>>> right. >>>>> Please reuse its mechanism or, if needed, export the relevant interfaces. >>> >>> So if I understand this right, some trampolines have been added that are >>> currently set as RWX at modification time AND left that way during runtime? >>> The >>> discussion on the order of set_memory_() calls in the commit message made me >>> think that this was just a modification time thing at first. >> >> I’m not sure what the status quo is. >> >> We really ought to have a genuinely good API for allocation and initialization >> of text. We can do so much better than set_memory_blahblah. >> >> FWIW, I have some ideas about making kernel flushes cheaper. It’s currently >> blocked on finding some time and on tglx’s irqtrace work. >> > > Makes sense to me. I guess there are 6 types of text allocations now: > - These two BPF trampolines > - BPF JITs > - Modules > - Kprobes > - Ftrace > > All doing (or should be doing) pretty much the same thing. I believe Jessica had > said at one point that she didn't like all the other features using > module_alloc() as it was supposed to be just for real modules. Where would the > API live? New header? This shouldn’t matter that much. Here are two strawman proposals. All of this is very rough -- the actual data structures and signatures are likely problematic for multiple reasons. --- First proposal --- struct text_allocation { void *final_addr; struct page *pages; int npages; }; int text_alloc(struct text_allocation *out, size_t size); /* now final_addr is not accessible and pages is writable. */ int text_freeze(struct text_allocation *alloc); /* now pages are not accessible and final_addr is RO. Alternatively, pages are RO and final_addr is unmapped. */ int text_finish(struct text_allocation *alloc); /* now final_addr is RX. All done. */ This gets it with just one flush and gives a chance to double-check in case of race attacks from other CPUs. Double-checking is annoying, though. --- Second proposal --- struct text_allocation { void *final_addr; /* lots of opaque stuff including an mm_struct */ /* optional: list of struct page, but this isn't obviously useful */ }; int text_alloc(struct text_allocation *out, size_t size); /* Memory is allocated. There is no way to access it at all right now. The memory is RO or not present in the direct map. */ void __user *text_activate_mapping(struct text_allocation *out); /* Now the text is RW at *user* address given by return value. Preemption is off if required by use_temporary_mm(). Real user memory cannot be accessed. */ void text_deactivate_mapping(struct text_allocation *alloc); /* Now the memory is inaccessible again. */ void text_finalize(struct text_allocation *alloc); /* Now it's RX or XO at the final address. */ Pros of second approach: - Inherently immune to cross-CPU attack. No double-check. - If we ever implement a cache of non-direct-mapped, unaliased pages, then it works with no flushes at all. We could even relax it a bit to allow non-direct-mapped pages that may have RX / XO aliases but no W aliases. - Can easily access without worrying about page boundaries. Cons: - The use of a temporary mm is annoying -- you can't copy from user memory, for example.