On Sat, Jul 09, 2022 at 01:14:23AM +0000, Song Liu wrote: > > On Jul 8, 2022, at 3:24 PM, Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote: > > > >> 1) Rename module_alloc_huge as module_alloc_text_huge(); > > > > module_alloc_text_huge() is too long, but I've suggested names before > > which are short and generic, and also suggested that if modules are > > not the only users this needs to go outside of modules and so > > vmalloc_text_huge() or whatever. > > > > To do this right it begs the question why we don't do that for the > > existing module_alloc(), as the users of this code is well outside of > > modules now. Last time a similar generic name was used all the special > > arch stuff was left to be done by the module code still, but still > > non-modules were still using that allocator. From my perspective the > > right thing to do is to deal with all the arch stuff as well in the > > generic handler, and have the module code *and* the other users which > > use module_alloc() to use that new caller as well. > > The key difference between module_alloc() and the new API is that the > API will return RO+X memory, and the user need text-poke like API to > modify this buffer. Archs that do not support text-poke will not be > able to use the new API. Does this sound like a reasonable design? I'm adding kprobe + ftrace folks. I can't see why we need to *require* text_poke for just a module_alloc_huge(). Enhancements on module_alloc() are just enhancements, not requirements. So we have these for instance: ``` from arch/Kconfig config ARCH_OPTIONAL_KERNEL_RWX def_bool n config ARCH_OPTIONAL_KERNEL_RWX_DEFAULT def_bool n config ARCH_HAS_STRICT_KERNEL_RWX def_bool n config STRICT_KERNEL_RWX bool "Make kernel text and rodata read-only" if ARCH_OPTIONAL_KERNEL_RWX depends on ARCH_HAS_STRICT_KERNEL_RWX default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT help If this is set, kernel text and rodata memory will be made read-only, and non-text memory will be made non-executable. This provides protection against certain security exploits (e.g. executing the heap or modifying text) These features are considered standard security practice these days. You should say Y here in almost all cases. config ARCH_HAS_STRICT_MODULE_RWX def_bool n config STRICT_MODULE_RWX bool "Set loadable kernel module data as NX and text as RO" if ARCH_OPTIONAL_KERNEL_RWX depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT help If this is set, module text and rodata memory will be made read-only, and non-text memory will be made non-executable. This provides protection against certain security exploits (e.g. writing to text) ``` With module_alloc() we have the above symbols to tell us when we *can* support strict module rwx. So the way the kernel's modules are allocated and used is: for each module section: module_alloc() module_enable_ro() module_enable_nx() module_enable_x() The above can be read in the code as: load_module() --> layout_and_allocate() complete_formation() Then there is the consideration of set_vm_flush_reset_perms() for freeing. On the module code we use this fore the RO+X stuff (core_layout, init_layout), but now that is a bit obfuscated due to the placement of the call. It would seem the other users use it for the same: * ebpf * kprobes * ftrace I believe you are mentioning requiring text_poke() because the way eBPF code uses the module_alloc() is different. Correct me if I'm wrong, but from what I gather is you use the text_poke_copy() as the data is already RO+X, contrary module_alloc() use cases. You do this since your bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after module_alloc() and before you can use this memory. This is a different type of allocator. And, again please correct me if I'm wrong but now you want to share *one* 2 MiB huge-page for multiple BPF programs to help with the impact of TLB misses. A vmalloc_ro_exec() by definition would imply a text_poke(). Can kprobes, ftrace and modules use it too? It would be nice so to not have to deal with the loose semantics on the user to have to use set_vm_flush_reset_perms() on ro+x later, but I think this can be addressed separately on a case by case basis. But a vmalloc_ro_exec() with a respective free can remove the requirement to do set_vm_flush_reset_perms(). Luis