Le 07/02/2023 à 01:28, Song Liu a écrit : > module_layout manages different types of memory (text, data, rodata, etc.) > in one allocation, which is problematic for some reasons: > > 1. It is hard to enable CONFIG_STRICT_MODULE_RWX. > 2. It is hard to use huge pages in modules (and not break strict rwx). > 3. Many archs uses module_layout for arch-specific data, but it is not > obvious how these data are used (are they RO, RX, or RW?) > > Improve the scenario by replacing 2 (or 3) module_layout per module with > up to 7 module_memory per module: > > MOD_TEXT, > MOD_DATA, > MOD_RODATA, > MOD_RO_AFTER_INIT, > MOD_INIT_TEXT, > MOD_INIT_DATA, > MOD_INIT_RODATA, > > and allocating them separately. This adds slightly more entries to > mod_tree (from up to 3 entries per module, to up to 7 entries per > module). However, this at most adds a small constant overhead to > __module_address(), which is expected to be fast. > > Various archs use module_layout for different data. These data are put > into different module_memory based on their location in module_layout. > IOW, data that used to go with text is allocated with MOD_MEM_TYPE_TEXT; > data that used to go with data is allocated with MOD_MEM_TYPE_DATA, etc. > > module_memory simplifies quite some of the module code. For example, > ARCH_WANTS_MODULES_DATA_IN_VMALLOC is a lot cleaner, as it just uses a > different allocator for the data. kernel/module/strict_rwx.c is also > much cleaner with module_memory. > > Signed-off-by: Song Liu <song@xxxxxxxxxx> > Cc: Luis Chamberlain <mcgrof@xxxxxxxxxx> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Guenter Roeck <linux@xxxxxxxxxxxx> > Cc: Christophe Leroy <christophe.leroy@xxxxxxxxxx> > > --- > This is the preparation work for the type aware module_alloc() discussed > in [1]. While this work is not covered much in the discussion, it is a > critical step of the effort. > > As this part grows pretty big (~1000 lines, + and -), I would like get > some feedback on it, so that I know it is on the right track. > > Please share your comments. Thanks! > > Test coverage: Tested on x86_64. > v9 passed build tests by kernel test bot in [2]. > > [1] https://lore.kernel.org/linux-mm/20221107223921.3451913-1-song@xxxxxxxxxx/T/#u > [2] https://lore.kernel.org/linux-raid/63df0daa.eKYOEelTitBUzF+e%25lkp@xxxxxxxxx/T/#u > > Changes v1 => v2: > 1. Add data_addr_[min|max] for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > case. > > Changes v2 => v3: > 1. Fix and remove the last use of INIT_OFFSET_MASK. > 2. Add more information in the commit log. (Luis Chamberlain) > 3. Rebase and fix issues in x86/calltrunks. > 4. Minor cleanup. > > Changes v3 => v4: > 1. Shorten enum/variable names, so that the code are easier to read. > (Christophe Leroy) > 2. Remove an used variable. (Guenter Roeck, Christophe Leroy) > > Changes v4 => v5: > 1. Simplify some code some code. (Peter Zijlstra, Christophe Leroy) > 2. Remove module_check_misalignment(), which is not useful any more. > > Changes v5 => v6: > 1. Improve mod_mem_type_is_* and for_*mod_mem_type marcos. > (Peter Zijlstra). > > Changes v6 => v7: > 1. Use --ignore-space-at-eol instead of -b. > > Changes v7 => v8: > 1. Address feedback from Peter. > > Changes v8 => v9: > 1. Fix a compile error for powerpc. (Christophe Leroy) > > Changes v9 => v10: > 1. Clean-up, style improvements, more comments. (Thomas Gleixner) > --- > arch/alpha/kernel/module.c | 2 +- > arch/arc/kernel/unwind.c | 9 +- > arch/arm/kernel/module-plts.c | 9 +- > arch/arm64/kernel/module-plts.c | 13 +- > arch/ia64/kernel/module.c | 24 +- > arch/mips/kernel/vpe.c | 11 +- > arch/parisc/kernel/module.c | 51 ++-- > arch/powerpc/kernel/module_32.c | 7 +- > arch/s390/kernel/module.c | 26 +- > arch/x86/kernel/callthunks.c | 4 +- > arch/x86/kernel/module.c | 4 +- > include/linux/module.h | 89 +++++-- > kernel/module/internal.h | 40 ++-- > kernel/module/kallsyms.c | 58 +++-- > kernel/module/kdb.c | 17 +- > kernel/module/main.c | 404 +++++++++++++++++--------------- > kernel/module/procfs.c | 16 +- > kernel/module/strict_rwx.c | 99 ++------ > kernel/module/tree_lookup.c | 39 ++- > 19 files changed, 457 insertions(+), 465 deletions(-) > > diff --git a/arch/alpha/kernel/module.c b/arch/alpha/kernel/module.c > index 5b60c248de9e..9109213abc09 100644 > --- a/arch/alpha/kernel/module.c > +++ b/arch/alpha/kernel/module.c > @@ -148,7 +148,7 @@ apply_relocate_add(Elf64_Shdr *sechdrs, const char *strtab, > > /* The small sections were sorted to the end of the segment. > The following should definitely cover them. */ > - gp = (u64)me->core_layout.base + me->core_layout.size - 0x8000; > + gp = (u64)me->mem[MOD_DATA].base + me->mem[MOD_DATA].size - 0x8000; > got = sechdrs[me->arch.gotsecindex].sh_addr; > > for (i = 0; i < n; i++) { > diff --git a/arch/arc/kernel/unwind.c b/arch/arc/kernel/unwind.c > index 200270a94558..933451f4494f 100644 > --- a/arch/arc/kernel/unwind.c > +++ b/arch/arc/kernel/unwind.c > @@ -369,6 +369,8 @@ void *unwind_add_table(struct module *module, const void *table_start, > unsigned long table_size) > { > struct unwind_table *table; > + struct module_memory *mod_mem_core_text; > + struct module_memory *mod_mem_init_text; This function is small (35 lines) so no need to have so big names for local functions, see https://docs.kernel.org/process/coding-style.html#naming struct module_memory *core_text; struct module_memory *init_text; > > if (table_size <= 0) > return NULL; > @@ -377,9 +379,12 @@ void *unwind_add_table(struct module *module, const void *table_start, > if (!table) > return NULL; > > + mod_mem_core_text = &module->mem[MOD_TEXT]; > + mod_mem_init_text = &module->mem[MOD_INIT_TEXT]; > + > init_unwind_table(table, module->name, > - module->core_layout.base, module->core_layout.size, > - module->init_layout.base, module->init_layout.size, > + mod_mem_core_text->base, mod_mem_core_text->size, > + mod_mem_init_text->base, mod_mem_init_text->size, > table_start, table_size, > NULL, 0); Shorter naming allows you to reduce the number of lines of the above function call: init_unwind_table(table, module->name, core_text->base, core_text->size, init_text->base, init_text->size, table_start, table_size, NULL, 0); > diff --git a/kernel/module/main.c b/kernel/module/main.c > index d3be89de706d..b9617d7e8db2 100644 > --- a/kernel/module/main.c > +++ b/kernel/module/main.c > @@ -80,12 +80,6 @@ struct mod_tree_root mod_tree __cacheline_aligned = { > .addr_min = -1UL, > }; > > -#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > -struct mod_tree_root mod_data_tree __cacheline_aligned = { > - .addr_min = -1UL, > -}; > -#endif > - > struct symsearch { > const struct kernel_symbol *start, *stop; > const s32 *crcs; > @@ -93,14 +87,24 @@ struct symsearch { > }; > > /* > - * Bounds of module text, for speeding up __module_address. > + * Bounds of module memory, for speeding up __module_address. > * Protected by module_mutex. > */ > -static void __mod_update_bounds(void *base, unsigned int size, struct mod_tree_root *tree) > +static void __mod_update_bounds(enum mod_mem_type type __maybe_unused, void *base, > + unsigned int size, struct mod_tree_root *tree) > { > unsigned long min = (unsigned long)base; > unsigned long max = min + size; > > +#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC A #ifdef shouldn't be required. You can use IS_ENABLED() instead: > + if (mod_mem_type_is_core_data(type)) { if (IS_ENABLED(CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC) && mod_mem_type_is_core_data(type)) > + if (min < tree->data_addr_min) > + tree->data_addr_min = min; > + if (max > tree->data_addr_max) > + tree->data_addr_max = max; > + return; > + } > +#endif > if (min < tree->addr_min) > tree->addr_min = min; > if (max > tree->addr_max) > @@ -109,12 +113,12 @@ static void __mod_update_bounds(void *base, unsigned int size, struct mod_tree_r > > static void mod_update_bounds(struct module *mod) > { > - __mod_update_bounds(mod->core_layout.base, mod->core_layout.size, &mod_tree); > - if (mod->init_layout.size) > - __mod_update_bounds(mod->init_layout.base, mod->init_layout.size, &mod_tree); > -#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > - __mod_update_bounds(mod->data_layout.base, mod->data_layout.size, &mod_data_tree); > -#endif > + for_each_mod_mem_type(type) { > + struct module_memory *mod_mem = &mod->mem[type]; > + > + if (mod_mem->size) > + __mod_update_bounds(type, mod_mem->base, mod_mem->size, &mod_tree); > + } > } > > /* Block module loading/unloading? */ > @@ -923,12 +927,27 @@ static ssize_t store_uevent(struct module_attribute *mattr, > struct module_attribute module_uevent = > __ATTR(uevent, 0200, NULL, store_uevent); > > +#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC It should be possible to have only one show_coresize() function, with IS_ENABLED(CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC) ; see https://docs.kernel.org/process/coding-style.html#conditional-compilation > + > static ssize_t show_coresize(struct module_attribute *mattr, > struct module_kobject *mk, char *buffer) > { > - return sprintf(buffer, "%u\n", mk->mod->core_layout.size); > + return sprintf(buffer, "%u\n", mk->mod->mem[MOD_TEXT].size); > } > > +#else > + > +static ssize_t show_coresize(struct module_attribute *mattr, > + struct module_kobject *mk, char *buffer) > +{ > + unsigned int size = 0; > + > + for_class_mod_mem_type(type, core) > + size += mk->mod->mem[type].size; > + return sprintf(buffer, "%u\n", size); > +} > +#endif > + > static struct module_attribute modinfo_coresize = > __ATTR(coresize, 0444, show_coresize, NULL); > > @@ -936,7 +955,11 @@ static struct module_attribute modinfo_coresize = > static ssize_t show_datasize(struct module_attribute *mattr, > struct module_kobject *mk, char *buffer) > { > - return sprintf(buffer, "%u\n", mk->mod->data_layout.size); > + unsigned int size = 0; > + > + for_class_mod_mem_type(type, core_data) > + size += mk->mod->mem[type].size; > + return sprintf(buffer, "%u\n", size); > } > > static struct module_attribute modinfo_datasize = > @@ -946,7 +969,11 @@ static struct module_attribute modinfo_datasize = > static ssize_t show_initsize(struct module_attribute *mattr, > struct module_kobject *mk, char *buffer) > { > - return sprintf(buffer, "%u\n", mk->mod->init_layout.size); > + unsigned int size = 0; > + > + for_class_mod_mem_type(type, init) > + size += mk->mod->mem[type].size; > + return sprintf(buffer, "%u\n", size); > } > > static struct module_attribute modinfo_initsize = > @@ -1143,6 +1170,65 @@ void __weak module_arch_freeing_init(struct module *mod) > { > } > > +#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC Same, could simply be return IS_ENABLED(CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC) && mod_mem_type_is_core_data(type); > +static bool mod_mem_use_vmalloc(enum mod_mem_type type) > +{ > + return mod_mem_type_is_core_data(type); > +} > +#else > +static bool mod_mem_use_vmalloc(enum mod_mem_type type) > +{ > + return false; > +} > +#endif > + > +static void *module_memory_alloc(unsigned int size, enum mod_mem_type type) > +{ > + if (mod_mem_use_vmalloc(type)) > + return vzalloc(size); > + return module_alloc(size); > +} > + > +static void module_memory_free(void *ptr, enum mod_mem_type type) > +{ > + if (mod_mem_use_vmalloc(type)) > + vfree(ptr); > + else > + module_memfree(ptr); > +} > + > +static void free_mod_mem(struct module *mod) > +{ > + /* free the memory in the right order to avoid use-after-free */ Instead of 'right order', explain what the right order is. As far as I understand it is only to free MOD_DATA last. Everything else doesn't matter. > + static enum mod_mem_type mod_mem_free_order[MOD_MEM_NUM_TYPES] = { > + /* first free init sections */ > + MOD_INIT_TEXT, > + MOD_INIT_DATA, > + MOD_INIT_RODATA, > + > + /* then core sections, except rw data */ > + MOD_TEXT, > + MOD_RODATA, > + MOD_RO_AFTER_INIT, > + > + /* last, rw data */ > + MOD_DATA, > + }; > + int i; > + > + BUILD_BUG_ON(ARRAY_SIZE(mod_mem_free_order) != MOD_MEM_NUM_TYPES); > + > + for (i = 0; i < MOD_MEM_NUM_TYPES; i++) { > + enum mod_mem_type type = mod_mem_free_order[i]; > + struct module_memory *mod_mem = &mod->mem[type]; > + > + /* Free lock-classes; relies on the preceding sync_rcu(). */ > + lockdep_free_key_range(mod_mem->base, mod_mem->size); > + if (mod_mem->size) > + module_memory_free(mod_mem->base, type); > + } > +} > + > /* Free a module, remove from lists, etc. */ > static void free_module(struct module *mod) > { > @@ -1428,84 +1504,63 @@ static void layout_sections(struct module *mod, struct load_info *info) > { SHF_WRITE | SHF_ALLOC, ARCH_SHF_SMALL }, > { ARCH_SHF_SMALL | SHF_ALLOC, 0 } > }; > - unsigned int m, i; > - > - for (i = 0; i < info->hdr->e_shnum; i++) > - info->sechdrs[i].sh_entsize = ~0UL; > + static const int core_m_to_mem_type[] = { > + MOD_TEXT, > + MOD_RODATA, > + MOD_RO_AFTER_INIT, > + MOD_DATA, > + MOD_INVALID, /* This is needed to match the masks array */ > + }; > + static const int init_m_to_mem_type[] = { > + MOD_INIT_TEXT, > + MOD_INIT_RODATA, > + MOD_INVALID, > + MOD_INIT_DATA, > + MOD_INVALID, /* This is needed to match the masks array */ > + }; > > - pr_debug("Core section allocation order:\n"); > for (m = 0; m < ARRAY_SIZE(masks); ++m) { > + enum mod_mem_type type = is_init ? init_m_to_mem_type[m] : core_m_to_mem_type[m]; > + > for (i = 0; i < info->hdr->e_shnum; ++i) { > Elf_Shdr *s = &info->sechdrs[i]; > const char *sname = info->secstrings + s->sh_name; > - unsigned int *sizep; > > if ((s->sh_flags & masks[m][0]) != masks[m][0] > || (s->sh_flags & masks[m][1]) > || s->sh_entsize != ~0UL > - || module_init_layout_section(sname)) > + || is_init != module_init_layout_section(sname)) > continue; > - sizep = m ? &mod->data_layout.size : &mod->core_layout.size; > - s->sh_entsize = module_get_offset(mod, sizep, s, i); > - pr_debug("\t%s\n", sname); > - } > - switch (m) { > - case 0: /* executable */ > - mod->core_layout.size = strict_align(mod->core_layout.size); Where is the strict alignment done now ? > - mod->core_layout.text_size = mod->core_layout.size; > - break; > - case 1: /* RO: text and ro-data */ > - mod->data_layout.size = strict_align(mod->data_layout.size); > - mod->data_layout.ro_size = mod->data_layout.size; > - break; > - case 2: /* RO after init */ > - mod->data_layout.size = strict_align(mod->data_layout.size); > - mod->data_layout.ro_after_init_size = mod->data_layout.size; > - break; > - case 4: /* whole core */ > - mod->data_layout.size = strict_align(mod->data_layout.size); > - break; > - } > - } > > - pr_debug("Init section allocation order:\n"); > - for (m = 0; m < ARRAY_SIZE(masks); ++m) { > - for (i = 0; i < info->hdr->e_shnum; ++i) { > - Elf_Shdr *s = &info->sechdrs[i]; > - const char *sname = info->secstrings + s->sh_name; > - > - if ((s->sh_flags & masks[m][0]) != masks[m][0] > - || (s->sh_flags & masks[m][1]) > - || s->sh_entsize != ~0UL > - || !module_init_layout_section(sname)) > + if (WARN_ON_ONCE(type == MOD_INVALID)) > continue; > - s->sh_entsize = (module_get_offset(mod, &mod->init_layout.size, s, i) > - | INIT_OFFSET_MASK); > + > + s->sh_entsize = module_get_offset_and_type(mod, type, s, i); > pr_debug("\t%s\n", sname); > } > - switch (m) { > - case 0: /* executable */ > - mod->init_layout.size = strict_align(mod->init_layout.size); > - mod->init_layout.text_size = mod->init_layout.size; > - break; > - case 1: /* RO: text and ro-data */ > - mod->init_layout.size = strict_align(mod->init_layout.size); > - mod->init_layout.ro_size = mod->init_layout.size; > - break; > - case 2: > - /* > - * RO after init doesn't apply to init_layout (only > - * core_layout), so it just takes the value of ro_size. > - */ > - mod->init_layout.ro_after_init_size = mod->init_layout.ro_size; > - break; > - case 4: /* whole init */ > - mod->init_layout.size = strict_align(mod->init_layout.size); > - break; > - } > } > } > > +/* > + * Lay out the SHF_ALLOC sections in a way not dissimilar to how ld > + * might -- code, read-only data, read-write data, small data. Tally > + * sizes, and place the offsets into sh_entsize fields: high bit means it > + * belongs in init. > + */ > +static void layout_sections(struct module *mod, struct load_info *info) > +{ > + unsigned int i; > + > + for (i = 0; i < info->hdr->e_shnum; i++) > + info->sechdrs[i].sh_entsize = ~0UL; > + > + pr_debug("Core section allocation order:\n"); > + __layout_sections(mod, info, false); > + > + pr_debug("Init section allocation order:\n"); > + __layout_sections(mod, info, true); > +} > + > static void set_license(struct module *mod, const char *license) > { > if (!license) > @@ -2122,72 +2177,42 @@ static int move_module(struct module *mod, struct load_info *info) > { > int i; > void *ptr; > + enum mod_mem_type t; > > - /* Do the allocs. */ > - ptr = module_alloc(mod->core_layout.size); > - /* > - * The pointer to this block is stored in the module structure > - * which is inside the block. Just mark it as not being a > - * leak. > - */ > - kmemleak_not_leak(ptr); > - if (!ptr) > - return -ENOMEM; > - > - memset(ptr, 0, mod->core_layout.size); > - mod->core_layout.base = ptr; > + for_each_mod_mem_type(type) { > + if (!mod->mem[type].size) { > + mod->mem[type].base = NULL; > + continue; > + } > + mod->mem[type].size = PAGE_ALIGN(mod->mem[type].size); > + ptr = module_memory_alloc(mod->mem[type].size, type); > > - if (mod->init_layout.size) { > - ptr = module_alloc(mod->init_layout.size); > /* > * The pointer to this block is stored in the module structure > - * which is inside the block. This block doesn't need to be > - * scanned as it contains data and code that will be freed > - * after the module is initialized. > + * which is inside the block. Just mark it as not being a > + * leak. > */ > kmemleak_ignore(ptr); > if (!ptr) { > - module_memfree(mod->core_layout.base); > - return -ENOMEM; > + t = type; > + goto out_enomem; > } > - memset(ptr, 0, mod->init_layout.size); > - mod->init_layout.base = ptr; > - } else > - mod->init_layout.base = NULL; > - > -#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > - /* Do the allocs. */ > - ptr = vzalloc(mod->data_layout.size); > - /* > - * The pointer to this block is stored in the module structure > - * which is inside the block. Just mark it as not being a > - * leak. > - */ > - kmemleak_not_leak(ptr); > - if (!ptr) { > - module_memfree(mod->core_layout.base); > - module_memfree(mod->init_layout.base); > - return -ENOMEM; > + memset(ptr, 0, mod->mem[type].size); > + mod->mem[type].base = ptr; > } > > - mod->data_layout.base = ptr; > -#endif > /* Transfer each section which specifies SHF_ALLOC */ > pr_debug("final section addresses:\n"); > for (i = 0; i < info->hdr->e_shnum; i++) { > void *dest; > Elf_Shdr *shdr = &info->sechdrs[i]; > + enum mod_mem_type type = shdr->sh_entsize >> SH_ENTSIZE_TYPE_SHIFT; > > if (!(shdr->sh_flags & SHF_ALLOC)) > continue; > > - if (shdr->sh_entsize & INIT_OFFSET_MASK) > - dest = mod->init_layout.base > - + (shdr->sh_entsize & ~INIT_OFFSET_MASK); > - else if (!(shdr->sh_flags & SHF_EXECINSTR)) > - dest = mod->data_layout.base + shdr->sh_entsize; > - else > - dest = mod->core_layout.base + shdr->sh_entsize; > + dest = mod->mem[type].base + > + (shdr->sh_entsize & SH_ENTSIZE_OFFSET_MASK); Can't that fit on a single line ? > > if (shdr->sh_type != SHT_NOBITS) > memcpy(dest, (void *)shdr->sh_addr, shdr->sh_size); > @@ -3060,20 +3091,21 @@ bool is_module_address(unsigned long addr) > struct module *__module_address(unsigned long addr) > { > struct module *mod; > - struct mod_tree_root *tree; > > if (addr >= mod_tree.addr_min && addr <= mod_tree.addr_max) > - tree = &mod_tree; > + goto lookup; > + > #ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC Can we try to avoid that #ifdef ? I know that means getting data_addr_min and data_addr_max alwyas existing, maybe through an unnamed union or a macro or a static inline helper ? > - else if (addr >= mod_data_tree.addr_min && addr <= mod_data_tree.addr_max) > - tree = &mod_data_tree; > + if (addr >= mod_tree.data_addr_min && addr <= mod_tree.data_addr_max) > + goto lookup; > #endif > - else > - return NULL; > > + return NULL; > + > +lookup: > module_assert_mutex_or_preempt(); > > - mod = mod_find(addr, tree); > + mod = mod_find(addr, &mod_tree); > if (mod) { > BUG_ON(!within_module(addr, mod)); > if (mod->state == MODULE_STATE_UNFORMED)