Song! On Wed, Nov 30 2022 at 08:18, Song Liu wrote: > On Tue, Nov 29, 2022 at 3:56 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >> You are not making anything easier. You are violating the basic >> engineering principle of "Fix the root cause, not the symptom". >> > > I am not sure what is the root cause and the symptom here. The symptom is iTLB pressure. The root cause is the way how module memory is allocated, which in turn causes the fragmentation into 4k PTEs. That's the same problem for anything which uses module_alloc() to get space for text allocated, e.g. kprobes, tracing.... A module consists of: - text sections - data sections Except for PPC32, which has the module data in vmalloc space, all others allocate text and data sections in one lump. This en-bloc allocation is one reason for the 4k splits: - text is RX - data is RW or RO Truly vmalloc'ed module data is not an option for 64bit architectures which use PC relative addressing as vmalloc does not guarantee that the data ends up within the limited displacement range (s32 on x8664) This made me look at your allocator again: > +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR) > +#define EXEC_MEM_START MODULES_VADDR > +#define EXEC_MEM_END MODULES_END > +#else > +#define EXEC_MEM_START VMALLOC_START > +#define EXEC_MEM_END VMALLOC_END > +#endif The #else part is completely broken on x86/64 and any other architecture, which has PC relative restricted displacement. Even if modules are disabled in Kconfig the only safe place to allocate executable kernel text from (on these architectures) is the modules address space. The ISA restrictions do not go magically away when modules are disabled. In the early version of the SKX retbleed mitigation work I had https://lore.kernel.org/all/20220716230953.442937066@xxxxxxxxxxxxx exactly to handle this correctly for the !MODULE case. It went nowhere as we did not need the trampolines in the final version. This is why Peter suggested to 'split' the module address range into a top down and bottom up part: https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ That obviously separates text and data, but keeps everything within the defined working range. It immediately solves the text problem for _all_ module_alloc() users and still leaves the data split into 4k pages due to RO/RW sections. But after staring at it for a while I think this top down and bottom up dance is too much effort for not much gain. The module address space is sized generously, so the straight forward solution is to split that space into two blocks and use them to allocate text and data separately. The rest of Peter's suggestions how to migrate there still apply. The init sections of a module are obviously separate as they are freed after the module is initialized, but they are not really special either. Today they leave holes in the address range. With the new scheme these holes will be in the memory backed large mapping, but I don't see a real issue with that, especially as those holes at least in text can be reused for small allocations (kprobes, trace, bpf). As a logical next step we make that three blocks and allocate text, data and rodata separately, which will preserve the large mappings for text and data. rodata still needs to be split because we need a space to accomodate ro_after_init data. Alternatively, instead of splitting the module address space, the allocation mechanism can keep track of the types (text, data, rodata) and manage large mapping blocks per type. There are pros and cons for both approaches, so that needs some thought. But at the end we want an allocation mechanism which: - preserves large mappings - handles a distinct address range - is mapping type aware That solves _all_ the issues of modules, kprobes, tracing, bpf in one go. See? Thanks, tglx