This is a proposal to address some inefficiencies in how memory permissions are handled on vmalloc mappings. The way the interfaces are defined across vmalloc and cpa make it hard to fully address problems underneath the existing interfaces. So this creates a new interface in vmalloc that encapsulates what vmalloc memory permission usages need, but with more details handled on the back end. This allows for optimizations and shared caches of resources. The genesis for this was this conversation[0] and many of the ideas were suggested by Andy Lutomirski. In its current state it takes module load's down to usually one kernel range shootdown, and BPF JIT load's down to usually zero on x86. It also minimizes the direct map 4k breakage when possible. For the future, x86 also has new kernel memory permission types that would benefit from efficiently handling the direct map permissions/unmapping, for example [1]. However, this patchset is just targeting improving performance inefficiencies with existing usages in modules and normal eBPF JITs. The code is early and very lightly tested. I was hoping to get some feedback on the approach. The Problem =========== For a little background, the way executable code is loaded (modules, BPF JITs, etc) on x86 goes something like this: ptr = vmalloc_node_range(, PAGE_KERNEL) alloc_page() - Get random pages from page allocator map_kernel_range() - Map vmalloc range allocation as RW to those pages set_memory_ro(ptr) set vmalloc alias to RO, break direct map aliases to 4k/RO, all-cpu shootdown vm_unmap_alias() flush any lazy RW aliases to now RO pages, all-cpu shootdown set_memory_x(ptr) set vmalloc alias to not-NX, all-cpu shootdown vm_unmap_alias(), possible all-cpu shootdown So during this load operation, 4 shootdowns may take place and the direct map will be broken to 4k pages across whichever random pages got used in the executable region. When a split is required it can be even more. Besides the direct map, the other reason for this is having to change the permission of the vmalloc mapping several times in order to load it while it's writable, and then transition it to its final permission. Ideally we would unmap pages from the direct map in bulk to share a shootdown. For changing the vmalloc mappings permission, we should instead map it at its final permission from the start and use a temporary per-cpu mapping such as text_poke() to load the data such that it only requires a local TLB flush. For large page breakage on the direct map, if multiple JITs happen to get pages from the same 2MB physical region this can limit the damage to a smaller region. However, currently this depends on lucky physical distance of the pages picked inside vmalloc. Today it seems more likely to happen if allocations are made close together in time. Ideally we would make an effort to group pages used for permissioned vmallocs together physically so the direct map breakage would be minimized. But trying to improve this doesn't fit into the existing interfaces very well. - vmalloc_node_range() doesn't know what it's final permission will be. - There isn't any cross-arch way to describe to vmalloc what the permissions will be, since permissions are encoded into the name of the set_memory_foo() functions. - text_poke() only exists on x86, and other HW ways of temporarily writing to RO mappings don't necessarily have standardized semantics. Proposed solution ================= For text or RO allocations, to oversimplify, what usages want to do is just say: 1. Give me a kva for this particular permission and size 2. Load this data into it 3. Make it "live" (no writable mapping, no direct map mapping, whatever permissions are set on it and ready to go) So this implements a new interface to do just that. I had in mind this interface should try to support the following optimizations on x86 even if they weren't implemented right away. 1. Draw from 2MB physical pages that can be unmapped from the direct map in contiguous chunks In memory pressure situations a shrinker callback can free unused pages from the cache. These can get re-mapped on the fly without any flush since the direct map would be transitioning NP->RW. Since we can re-map the direct map cheaply, it's better to unmap more than we need. This part is close to secretmem[2] in implementation, and should possibly share infrastructure or caches of unmapped pages. 2. Load text/special data via per-cpu mappings The mapping can be mapped in its "final" permission, and loaded via text_poke(). This will reduce shootdowns during loads to zero is most cases. Just local flushes. The new interface provides a writable buffer for usages to stage their changes, and trigger the copying into the RO mapping via text_poke() 3. Caching of virtual mappings as well as pages Normally executable mappings need to be zapped and flushed before the pages return to the page allocator to prevent random other memory that uses the page later from having an executable alias. But we could cache these live mappings and delay the flush until the page is needed for an allocation of a larger size or different permission. The "free" operations could just zero it with a per-cpu mapping to prevent unwanted text from remaining mapped. 4. 2MB module space mappings It would be nice if the virtual mappings of the same permission types could be placed next to each other so that they could share 2MB mappings. This way we could have modules or PKS memory have 2MB pages. Of course allocating from a 2MB block could cause internal fragmentation and wasted memory, however it might be possible to break the virtual mapping later and allow the wasted memory to be unmapped and freed in the formerly 2MB page. Often a bunch of modules are loaded at boot. If we placed the long lived "core sections" of these modules sequentially into the 2MB blocks, there is probably a good chance we could get some decent utilization out of one. This RFC just has 1 and 2 actually implemented on x86. Module loader changes ===================== Of the text allocation usages, kernel modules are the most complex because a single vmalloc allocation has many memory permissions across it (RO+X for the text, RO, RO after init, and RW). In addition to this preventing having module text mapped in 2MB pages since the text is all scattered around in different allocations, it would require more complexity for the new interface. However, at least for x86, it doesn't seem like there is any requirement for a module allocation to be virtually contiguous. Instead we could have the module loader treat each of its 4 permission regions as separate allocations. Then the new interface could be simpler and it could have the option of putting similar permission allocations next to each other to achieve 2MB pages or more opportunities to reuse existing mappings. The challenge in changing this in the module loader is that most of it is cross-arch code and there could be relocation rules required by various arch's that depend on the existing virtual address distances. To try to transition to this interface without disturbing anything, the default module.c behavior is to layout the modules as they were before in both location and permissions, but wrapped separately as multiple instances of the new type of allocation. This way it could have no functional change for other architectures at first, but allow any to implement similar optimizations in the arch module.c breakouts. So this RFC also looks at handling things as separate allocations, and actually allocates them separately for x86. Of course, there are several areas outside of modules that are involved in modifying the module text and data such as alternatives, orc unwinding, etc. These components are changed to be aware they may need to opearate on the writable staging area buffer. [0] https://lore.kernel.org/lkml/CALCETrV_tGk=B3Hw0h9viW45wMqB_W+rwWzx6LnC3-vSATOUOA@xxxxxxxxxxxxxx/ [1] https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.weiny@xxxxxxxxx/ [2] https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@xxxxxxxxxx/ This RFC has been acked by Dave Hansen. Rick Edgecombe (10): vmalloc: Add basic perm alloc implementation bpf: Use perm_alloc() for BPF JIT filters module: Use perm_alloc() for modules module: Support separate writable allocation x86/modules: Use real perm_allocations x86/alternatives: Handle perm_allocs for modules x86/unwind: Unwind orc at module writable address jump_label: Handle module writable address ftrace: Use module writable address vmalloc: Add perm_alloc x86 implementation arch/Kconfig | 3 + arch/arm/net/bpf_jit_32.c | 3 +- arch/arm64/net/bpf_jit_comp.c | 5 +- arch/mips/net/bpf_jit.c | 2 +- arch/mips/net/ebpf_jit.c | 3 +- arch/powerpc/net/bpf_jit_comp.c | 2 +- arch/powerpc/net/bpf_jit_comp64.c | 10 +- arch/s390/net/bpf_jit_comp.c | 5 +- arch/sparc/net/bpf_jit_comp_32.c | 2 +- arch/sparc/net/bpf_jit_comp_64.c | 5 +- arch/x86/Kconfig | 1 + arch/x86/include/asm/set_memory.h | 2 + arch/x86/kernel/alternative.c | 25 +- arch/x86/kernel/jump_label.c | 18 +- arch/x86/kernel/module.c | 84 ++++- arch/x86/kernel/unwind_orc.c | 8 +- arch/x86/mm/Makefile | 1 + arch/x86/mm/pat/set_memory.c | 13 + arch/x86/mm/vmalloc.c | 438 +++++++++++++++++++++++ arch/x86/net/bpf_jit_comp.c | 15 +- arch/x86/net/bpf_jit_comp32.c | 3 +- include/linux/filter.h | 30 +- include/linux/module.h | 66 +++- include/linux/vmalloc.h | 82 +++++ kernel/bpf/core.c | 48 ++- kernel/jump_label.c | 2 +- kernel/module.c | 561 ++++++++++++++++++------------ kernel/trace/ftrace.c | 2 +- mm/nommu.c | 66 ++++ mm/vmalloc.c | 135 +++++++ 30 files changed, 1308 insertions(+), 332 deletions(-) create mode 100644 arch/x86/mm/vmalloc.c -- 2.20.1