> On Jul 13, 2022, at 3:20 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Wed, Jul 13, 2022 at 12:18:44AM -0700, Song Liu wrote: >> Dynamically allocated kernel texts, such as module texts, bpf programs, >> and ftrace trampolines, are used in more and more scenarios. Currently, >> these users allocate meory with module_alloc, fill the memory with text, >> and then use set_memory_[ro|x] to protect the memory. >> >> This approach has two issues: >> 1) each of these user occupies one or more RO+X page, and thus one or >> more entry in the page table and the iTLB; >> 2) frequent allocate/free of RO+X pages causes fragmentation of kernel >> direct map [1]. >> >> BPF prog pack [2] addresses this from the BPF side. Now, make the same >> logic available to other users of dynamic kernel text. >> >> The new API is like: >> >> void *vmalloc_exec(size_t size); >> void vfree_exec(void *addr, size_t size); >> >> vmalloc_exec has different handling for small and big allocations >> (> PMD_SIZE * num_possible_nodes). bigger allocations have dedicated >> vmalloc allocation; while small allocations share a vmalloc_exec_pack >> with other allocations. >> >> Once allocated, the vmalloc_exec_pack is filled with invalid instructions > > *sigh*, again, INT3 is a *VALID* instruction. I am fully aware "invalid" or "illegal" is not accurate, but I am not sure what to use. Shall we call them "safe" instructions? > >> and protected with RO+X. Some text_poke feature is required to make >> changes to the vmalloc_exec_pack. Therefore, vmalloc_exec requires changes >> from the arch (to provide text_poke family APIs), and the user (to use >> text poke APIs to make any changes to the memory). > > I hate the naming; this isn't just vmalloc, this is a whole different > allocator build on top of things. > > I'm also not convinced this is the right way to go about doing this; > much of the design here is because of how the module range is mixing > text and data and working around that. Hmm.. I am not sure mixed data/text is the only problem here. > > So how about instead we separate them? Then much of the problem goes > away, you don't need to track these 2M chunks at all. If we manage the memory in < 2MiB granularity, either 4kB or smaller, we still need some way to track which parts are being used, no? I mean the bitmap. > > Start by adding VM_TOPDOWN_VMAP, which instead of returning the lowest > (leftmost) vmap_area that fits, picks the higests (rightmost). > > Then add module_alloc_data() that uses VM_TOPDOWN_VMAP and make > ARCH_WANTS_MODULE_DATA_IN_VMALLOC use that instead of vmalloc (with a > weak function doing the vmalloc). > > This gets you bottom of module range is RO+X only, top is shattered > between different !X types. > > Then track the boundary between X and !X and ensure module_alloc_data() > and module_alloc() never cross over and stay strictly separated. > > Then change all module_alloc() users to expect RO+X memory, instead of > RW. > > Then make sure any extention of the X range is 2M aligned. > > And presto, *everybody* always uses 2M TLB for text, modules, bpf, > ftrace, the lot and nobody is tracking chunks. > > Maybe migration can be eased by instead providing module_alloc_text() > and ARCH_WANTS_MODULE_ALLOC_TEXT. If we have the text/data separation, can we just put text after _etext? Right now, we allocate huge pages for _stext to round_down(_etext, 2MB), and 4kB pages for round_down(_etext, 2MB) to round_up(_etext, 4kB). To make this more efficient, we can allocate huge pages for _stext to round_up(_etext, 2MB), and use _etext to round_up(_etext, 2MB) as the first pool of memory for module_alloc_text(). Once we used all the memory there, we allocate more huge pages after round_up(_etext, 2MB). I am not sure how to make this work, but I guess this is similar to the idea you are describing here? However, we will need some bitmap to track the usage of these memory pools, right? Thanks, Song