On 4/14/23 07:08, Luis Chamberlain wrote: > Loading modules with finit_module() can end up using vmalloc(), vmap() > and vmalloc() again, for a total of up to 3 separate allocations in the > worst case for a single module. We always kernel_read*() the module, > that's a vmalloc(). Then vmap() is used for the module decompression, > and if so the last read buffer is freed as we use the now decompressed > module buffer to stuff data into our copy module. The last allocation is > specific to each architectures but pretty much that's generally a series > of vmalloc() calls or a variation of vmalloc to handle ELF sections with > special permissions. > > Evaluation with new stress-ng module support [1] with just 100 ops > is proving that you can end up using GiBs of data easily even with all > care we have in the kernel and userspace today in trying to not load modules > which are already loaded. 100 ops seems to resemble the sort of pressure a > system with about 400 CPUs can create on module loading. Although issues > relating to duplicate module requests due to each CPU inucurring a new > module reuest is silly and some of these are being fixed, we currently lack > proper tooling to help diagnose easily what happened, when it happened > and who likely is to blame -- userspace or kernel module autoloading. > > Provide an initial set of stats which use debugfs to let us easily scrape > post-boot information about failed loads. This sort of information can > be used on production worklaods to try to optimize *avoiding* redundant > memory pressure using finit_module(). This looks useful, thanks for it. Some comments below. > > There's a few examples that can be provided: > > A 255 vCPU system without the next patch in this series applied: > > Startup finished in 19.143s (kernel) + 7.078s (userspace) = 26.221s > graphical.target reached after 6.988s in userspace > > And 13.58 GiB of virtual memory space lost due to failed module loading: > > root@big ~ # cat /sys/kernel/debug/modules/stats > Mods ever loaded 67 > Mods failed on kread 0 > Mods failed on decompress 0 > Mods failed on becoming 0 > Mods failed on load 1411 > Total module size 11464704 > Total mod text size 4194304 > Failed kread bytes 0 > Failed decompress bytes 0 > Failed becoming bytes 0 > Failed kmod bytes 14588526272 > Virtual mem wasted bytes 14588526272 > Average mod size 171115 > Average mod text size 62602 > Average fail load bytes 10339140 > Duplicate failed modules: > module-name How-many-times Reason > kvm_intel 249 Load > kvm 249 Load > irqbypass 8 Load > crct10dif_pclmul 128 Load > ghash_clmulni_intel 27 Load > sha512_ssse3 50 Load > sha512_generic 200 Load > aesni_intel 249 Load > crypto_simd 41 Load > cryptd 131 Load > evdev 2 Load > serio_raw 1 Load > virtio_pci 3 Load > nvme 3 Load > nvme_core 3 Load > virtio_pci_legacy_dev 3 Load > virtio_pci_modern_dev 3 Load > t10_pi 3 Load > virtio 3 Load > crc32_pclmul 6 Load > crc64_rocksoft 3 Load > crc32c_intel 40 Load > virtio_ring 3 Load > crc64 3 Load > > The following screen shot, of a simple 8vcpu 8 GiB KVM guest with the > next patch in this series applied, shows 226.53 MiB are wasted in virtual > memory allocations which due to duplicate module requests during boot. > It also shows an average module memory size of 167.10 KiB and an an > average module .text + .init.text size of 61.13 KiB. The end shows all > modules which were detected as duplicate requests and whether or not > they failed early after just the first kernel_read*() call or late after > we've already allocated the private space for the module in > layout_and_allocate(). A system with module decompression would reveal > more wasted virtual memory space. > > We should put effort now into identifying the source of these duplicate > module requests and trimming these down as much possible. Larger systems > will obviously show much more wasted virtual memory allocations. > > root@kmod ~ # cat /sys/kernel/debug/modules/stats > Mods ever loaded 67 > Mods failed on kread 0 > Mods failed on decompress 0 > Mods failed on becoming 83 > Mods failed on load 16 > Total module size 11464704 > Total mod text size 4194304 > Failed kread bytes 0 > Failed decompress bytes 0 > Failed becoming bytes 228959096 > Failed kmod bytes 8578080 > Virtual mem wasted bytes 237537176 > Average mod size 171115 > Average mod text size 62602 > Avg fail becoming bytes 2758544 > Average fail load bytes 536130 > Duplicate failed modules: > module-name How-many-times Reason > kvm_intel 7 Becoming > kvm 7 Becoming > irqbypass 6 Becoming & Load > crct10dif_pclmul 7 Becoming & Load > ghash_clmulni_intel 7 Becoming & Load > sha512_ssse3 6 Becoming & Load > sha512_generic 7 Becoming & Load > aesni_intel 7 Becoming > crypto_simd 7 Becoming & Load > cryptd 3 Becoming & Load > evdev 1 Becoming > serio_raw 1 Becoming > nvme 3 Becoming > nvme_core 3 Becoming > t10_pi 3 Becoming > virtio_pci 3 Becoming > crc32_pclmul 6 Becoming & Load > crc64_rocksoft 3 Becoming > crc32c_intel 3 Becoming > virtio_pci_modern_dev 2 Becoming > virtio_pci_legacy_dev 1 Becoming > crc64 2 Becoming > virtio 2 Becoming > virtio_ring 2 Becoming > > [0] https://github.com/ColinIanKing/stress-ng.git > [1] echo 0 > /proc/sys/vm/oom_dump_tasks > ./stress-ng --module 100 --module-name xfs > > Signed-off-by: Luis Chamberlain <mcgrof@xxxxxxxxxx> > --- > Documentation/core-api/kernel-api.rst | 22 +- > kernel/module/Kconfig | 37 +++ > kernel/module/Makefile | 1 + > kernel/module/decompress.c | 4 + > kernel/module/internal.h | 74 +++++ > kernel/module/main.c | 65 +++- > kernel/module/stats.c | 432 ++++++++++++++++++++++++++ > kernel/module/tracking.c | 7 +- > 8 files changed, 630 insertions(+), 12 deletions(-) > create mode 100644 kernel/module/stats.c > > diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst > index e27728596008..9b3f3e5f5a95 100644 > --- a/Documentation/core-api/kernel-api.rst > +++ b/Documentation/core-api/kernel-api.rst > @@ -220,12 +220,30 @@ relay interface > Module Support > ============== > > -Module Loading > --------------- > +Kernel module auto-loading > +-------------------------- > > .. kernel-doc:: kernel/module/kmod.c > :export: > > +Module debugging > +---------------- > + > +.. kernel-doc:: kernel/module/stats.c > + :doc: module debugging statistics overview > + > +dup_failed_modules - tracks duplicate failed modules > +**************************************************** > + > +.. kernel-doc:: kernel/module/stats.c > + :doc: dup_failed_modules - tracks duplicate failed modules > + > +module statistics debugfs counters > +********************************** > + > +.. kernel-doc:: kernel/module/stats.c > + :doc: module statistics debugfs counters > + > Inter Module support > -------------------- > > diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig > index 424b3bc58f3f..ca277b945a67 100644 > --- a/kernel/module/Kconfig > +++ b/kernel/module/Kconfig > @@ -22,6 +22,43 @@ menuconfig MODULES > > if MODULES > > +config MODULE_DEBUG > + bool "Module debugging" > + depends on DEBUG_FS > + help > + Allows you to enable / disable features which can help you debug > + modules. You don't need these options in produciton systems. You can > + and probably should enable this prior to making your kernel > + produciton ready though. 2x typo: produciton -> production. The last sentence could be misinterpreted to mean that you should enable this to make your kernel production ready. Not sure, maybe I would just drop this sentence. Note that there are plenty of other typos in the added comments and documentation. Please review them with a spell checker. > + > +if MODULE_DEBUG > + > +config MODULE_STATS > + bool "Module statistics" > + depends on DEBUG_FS > + help > + This option allows you to maintain a record of module statistics. > + For example each all modules size, average size, text size, and > + failed modules and the size for each of those. For failed This sentence doesn't quite make sense. I guess it should say something as: For example, size of all modules, average size, text size, a list of failed modules and the size for each of those. > + modules we keep track of module which failed due to either the > + existing module taking too long to load or that module already > + was loaded. > + > + You should enable this if you are debugging production loads > + and want to see if userspace or the kernel is doing stupid things > + with loading modules when it shouldn't or if you want to help > + optimize userspace / kernel space module autoloading schemes. > + You might want to do this because failed modules tend to use > + use up significan amount of memory, and so you'd be doing everyone Word 'use' is repeated twice. > + a favor in avoiding these failure proactively. > + > + This functionality is also useful for those experimenting with > + module .text ELF section optimization. > + > + If unsure, say N. > + > +endif # MODULE_DEBUG > + > config MODULE_FORCE_LOAD > bool "Forced module loading" > default n > diff --git a/kernel/module/Makefile b/kernel/module/Makefile > index 5b1d26b53b8d..52340bce497e 100644 > --- a/kernel/module/Makefile > +++ b/kernel/module/Makefile > @@ -21,3 +21,4 @@ obj-$(CONFIG_SYSFS) += sysfs.o > obj-$(CONFIG_KGDB_KDB) += kdb.o > obj-$(CONFIG_MODVERSIONS) += version.o > obj-$(CONFIG_MODULE_UNLOAD_TAINT_TRACKING) += tracking.o > +obj-$(CONFIG_MODULE_STATS) += stats.o > diff --git a/kernel/module/decompress.c b/kernel/module/decompress.c > index 7ddc87bee274..e97232b125eb 100644 > --- a/kernel/module/decompress.c > +++ b/kernel/module/decompress.c > @@ -297,6 +297,10 @@ int module_decompress(struct load_info *info, const void *buf, size_t size) > ssize_t data_size; > int error; > > +#if defined(CONFIG_MODULE_STATS) > + info->compressed_len = size; > +#endif > + > /* > * Start with number of pages twice as big as needed for > * compressed data. > diff --git a/kernel/module/internal.h b/kernel/module/internal.h > index 6ae29bb8836f..9d97a59a9127 100644 > --- a/kernel/module/internal.h > +++ b/kernel/module/internal.h > @@ -59,6 +59,9 @@ struct load_info { > unsigned long mod_kallsyms_init_off; > #endif > #ifdef CONFIG_MODULE_DECOMPRESS > +#ifdef CONFIG_MODULE_STATS > + unsigned long compressed_len; > +#endif > struct page **pages; > unsigned int max_pages; > unsigned int used_pages; > @@ -143,6 +146,77 @@ static inline bool set_livepatch_module(struct module *mod) > #endif > } > > +/** > + * enum fail_dup_mod_reason - state at which a duplicate module was detected > + * > + * @FAIL_DUP_MOD_BECOMING: the module is read properly, passes all checks but > + * we've determined that another module with the same name is already loaded > + * or being processed on our &modules list. This happens on early_mod_check() > + * right before layout_and_allocate(). The kernel would have already > + * vmalloc()'d space for the entire module through finit_module(). If > + * decompression was used two vmap() spaces were used. These failures can > + * happen when userspace has not seen the module present on the kernel and > + * tries to load the module multiple times at same time. > + * @FAIL_DUP_MOD_LOAD: the module has been read properly, passes all validation > + * checks and the kernel determines that the module was unique and because > + * of this allocated yet another private kernel copy of the module space in > + * layout_and_allocate() but after this determined in add_unformed_module() > + * that another module with the same name is already loaded or being processed. > + * These failures should be mitigated as much as possible and are indicative > + * of really fast races in loading modules. Without module decompression > + * they waste twice as much vmap space. With module decompression three > + * times the module's size vmap space is wasted. > + */ > +enum fail_dup_mod_reason { > + FAIL_DUP_MOD_BECOMING = 0, > + FAIL_DUP_MOD_LOAD, > +}; > + > +#ifdef CONFIG_MODULE_STATS > + > +#define mod_stat_add_long(count, var) atomic_long_add(count, var) > +#define mod_stat_inc(name) atomic_inc(name) > + > +extern atomic_long_t total_mod_size; > +extern atomic_long_t total_text_size; > +extern atomic_long_t invalid_kread_bytes; > +extern atomic_long_t invalid_decompress_bytes; > + > +extern atomic_t modcount; > +extern atomic_t failed_kreads; > +extern atomic_t failed_decompress; > +struct mod_fail_load { > + struct list_head list; > + char name[MODULE_NAME_LEN]; > + atomic_long_t count; > + unsigned long dup_fail_mask; > +}; > + > +int try_add_failed_module(const char *name, size_t len, enum fail_dup_mod_reason reason); > +void mod_stat_bump_invalid(struct load_info *info, int flags); > +void mod_stat_bump_becoming(struct load_info *info, int flags); > + > +#else > + > +#define mod_stat_add_long(name, var) > +#define mod_stat_inc(name) > + > +static inline int try_add_failed_module(const char *name, size_t len, > + enum fail_dup_mod_reason reason) > +{ > + return 0; > +} > + > +static inline void mod_stat_bump_invalid(struct load_info *info, int flags) > +{ > +} > + > +static inline void mod_stat_bump_becoming(struct load_info *info, int flags) > +{ > +} > + > +#endif /* CONFIG_MODULE_STATS */ > + > #ifdef CONFIG_MODULE_UNLOAD_TAINT_TRACKING > struct mod_unload_taint { > struct list_head list; > diff --git a/kernel/module/main.c b/kernel/module/main.c > index 75b23257128d..5642d77657a0 100644 > --- a/kernel/module/main.c > +++ b/kernel/module/main.c > @@ -56,6 +56,7 @@ > #include <linux/dynamic_debug.h> > #include <linux/audit.h> > #include <linux/cfi.h> > +#include <linux/debugfs.h> > #include <uapi/linux/module.h> > #include "internal.h" > > @@ -87,6 +88,8 @@ struct symsearch { > enum mod_license license; > }; > > +struct dentry *mod_debugfs_root; > + > /* > * Bounds of module memory, for speeding up __module_address. > * Protected by module_mutex. > @@ -2500,6 +2503,18 @@ static noinline int do_init_module(struct module *mod) > { > int ret = 0; > struct mod_initfree *freeinit; > +#if defined(CONFIG_MODULE_STATS) > + unsigned int text_size = 0, total_size = 0; > + > + for_each_mod_mem_type(type) { > + const struct module_memory *mod_mem = &mod->mem[type]; > + if (mod_mem->size) { > + total_size += mod_mem->size; > + if (type == MOD_TEXT || type == MOD_INIT_TEXT) > + text_size += mod->mem[type].size; 'text_size += mod_mem->size;' would be simpler. > + } > + } > +#endif > > freeinit = kmalloc(sizeof(*freeinit), GFP_KERNEL); > if (!freeinit) { > @@ -2561,6 +2576,7 @@ static noinline int do_init_module(struct module *mod) > mod->mem[type].base = NULL; > mod->mem[type].size = 0; > } > + > #ifdef CONFIG_DEBUG_INFO_BTF_MODULES > /* .BTF is not SHF_ALLOC and will get removed, so sanitize pointer */ > mod->btf_data = NULL; > @@ -2584,6 +2600,11 @@ static noinline int do_init_module(struct module *mod) > mutex_unlock(&module_mutex); > wake_up_all(&module_wq); > > + mod_stat_add_long(text_size, &total_text_size); > + mod_stat_add_long(total_size, &total_mod_size); > + > + mod_stat_inc(&modcount); > + > return 0; > > fail_free_freeinit: > @@ -2599,6 +2620,7 @@ static noinline int do_init_module(struct module *mod) > ftrace_release_mod(mod); > free_module(mod); > wake_up_all(&module_wq); > + > return ret; > } > > @@ -2632,7 +2654,8 @@ static bool finished_loading(const char *name) > } > > /* Must be called with module_mutex held */ > -static int module_patient_check_exists(const char *name) > +static int module_patient_check_exists(const char *name, > + enum fail_dup_mod_reason reason) > { > struct module *old; > int err = 0; > @@ -2655,6 +2678,9 @@ static int module_patient_check_exists(const char *name) > old = find_module_all(name, strlen(name), true); > } > > + if (try_add_failed_module(name, strlen(name), reason)) > + pr_warn("Could not add fail-tracking for module: %s\n", name); > + > /* > * We are here only when the same module was being loaded. Do > * not try to load it again right now. It prevents long delays > @@ -2679,7 +2705,7 @@ static int add_unformed_module(struct module *mod) > mod->state = MODULE_STATE_UNFORMED; > > mutex_lock(&module_mutex); > - err = module_patient_check_exists(mod->name); > + err = module_patient_check_exists(mod->name, FAIL_DUP_MOD_LOAD); > if (err) > goto out; > > @@ -2800,6 +2826,7 @@ static int load_module(struct load_info *info, const char __user *uargs, > int flags) > { > struct module *mod; > + bool module_allocated = false; > long err = 0; > char *after_dashes; > > @@ -2839,6 +2866,8 @@ static int load_module(struct load_info *info, const char __user *uargs, > goto free_copy; > } > > + module_allocated = true; > + > audit_log_kern_module(mod->name); > > /* Reserve our place in the list. */ > @@ -2983,6 +3012,7 @@ static int load_module(struct load_info *info, const char __user *uargs, > synchronize_rcu(); > mutex_unlock(&module_mutex); > free_module: > + mod_stat_bump_invalid(info, flags); > /* Free lock-classes; relies on the preceding sync_rcu() */ > for_class_mod_mem_type(type, core_data) { > lockdep_free_key_range(mod->mem[type].base, > @@ -2991,6 +3021,13 @@ static int load_module(struct load_info *info, const char __user *uargs, > > module_deallocate(mod, info); > free_copy: > + /* > + * The info->len is always set. We distinguish between > + * failures once the proper module was allocated and > + * before that. > + */ > + if (!module_allocated) > + mod_stat_bump_becoming(info, flags); > free_copy(info, flags); > return err; > } > @@ -3009,8 +3046,11 @@ SYSCALL_DEFINE3(init_module, void __user *, umod, > umod, len, uargs); > > err = copy_module_from_user(umod, len, &info); > - if (err) > + if (err) { > + mod_stat_inc(&failed_kreads); > + mod_stat_add_long(len, &invalid_kread_bytes); > return err; > + } > > return load_module(&info, uargs, 0); > } > @@ -3035,14 +3075,20 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) > > len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL, > READING_MODULE); > - if (len < 0) > + if (len < 0) { > + mod_stat_inc(&failed_kreads); > + mod_stat_add_long(len, &invalid_kread_bytes); > return len; > + } > > if (flags & MODULE_INIT_COMPRESSED_FILE) { > err = module_decompress(&info, buf, len); > vfree(buf); /* compressed data is no longer needed */ > - if (err) > + if (err) { > + mod_stat_inc(&failed_decompress); > + mod_stat_add_long(len, &invalid_decompress_bytes); > return err; > + } > } else { > info.hdr = buf; > info.len = len; > @@ -3216,3 +3262,12 @@ void print_modules(void) > last_unloaded_module.taints); > pr_cont("\n"); > } > + > +#ifdef CONFIG_MODULE_DEBUG > +static int module_debugfs_init(void) > +{ > + mod_debugfs_root = debugfs_create_dir("modules", NULL); > + return 0; > +} > +module_init(module_debugfs_init); > +#endif > diff --git a/kernel/module/stats.c b/kernel/module/stats.c > new file mode 100644 > index 000000000000..d4b5b2b9e6ad > --- /dev/null > +++ b/kernel/module/stats.c > @@ -0,0 +1,432 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * Debugging module statistics. > + * > + * Copyright (C) 2023 Luis Chamberlain <mcgrof@xxxxxxxxxx> > + */ > + > +#include <linux/module.h> > +#include <linux/string.h> > +#include <linux/printk.h> > +#include <linux/slab.h> > +#include <linux/list.h> > +#include <linux/debugfs.h> > +#include <linux/rculist.h> > +#include <linux/math.h> > + > +#include "internal.h" > + > +/** > + * DOC: module debugging statistics overview > + * > + * Enabling CONFIG_MODULE_STATS enables module debugging statistics which > + * are useful to monitor and root cause memory pressure issues with module > + * loading. These statistics are useful to allow us to improve production > + * workloads. > + * > + * The current module debugging statistics supported help keep track of module > + * loading failures to enable improvements either for kernel module > + * auto-loading usage (request_module()) or interactions with userspace. > + * Statistics are provided to track of all possible failures in the > + * finit_module() path and memory wasted in this process space. Each of the > + * failure counters are associated to a type of module loading failure which > + * is known to incur a certain amount of memory allocation loss. In the worst > + * case loading a module will fail after a 3 step memory allocation process: > + * > + * a) memory allocated with kernel_read_file_from_fd() > + * b) module decompression processes the file read from > + * kernel_read_file_from_fd(), and vmap() is used to map > + * the decompressed module to a new local buffer which represents > + * a copy of the decompressed module passed from userspace. The buffer > + * from kernel_read_file_from_fd() is freed right away. > + * c) layout_and_allocate() allocates space for the final resting > + * place where we would keep the module if it were to be processed > + * successfully. > + * > + * If a failure occurs after these three different allocations only one > + * counters will be incremetned with the summation of the lost bytes incurred > + * during this failure. Likewise, if a module loading failed only after step b) > + * a separate counter is used and incremented for the bytes lost during both > + * of those allocations. > + * > + * Virtual memory space can be limited, for example on x86 virtual memory size > + * defaults to 128 MiB. We should strive to limit and avoid wasting virtual > + * memory allocations when possible. These module dubugging statistics help > + * to evaluate how much memory is being wasted on bootup due to module loading > + * failures. > + * > + * All counters are designed to be incremental. Atomic counters are used so to > + * remain simple and avoid delays and deadlocks. > + */ > + > +extern struct dentry *mod_debugfs_root; Files kernel/module/stats.c and kernel/module/tracking.c both add this extern declaration. Can it be moved to kernel/module/internal.h? > + > +/** > + * DOC: dup_failed_modules - tracks duplicate failed modules > + * > + * Linked list of modules which failed to be loaded because an already existing > + * module with the same name was already being processed or already loaded. > + * The finit_module() system call incurs heavy virtual memory allocations. In > + * the worst case an finit_module() system call can end up allocating virtual > + * memory 3 times: > + * > + * 1) kernel_read_file_from_fd() call uses vmalloc() > + * 2) optional module decompression uses vmap() > + * 3) layout_and allocate() can use vzalloc() or an arch specific variation of > + * vmalloc to deal with ELF sections requiring special permissions > + * > + * In practice on a typical boot today most finit_module() calls fail due to > + * the module with the same name already being loaded or about to be processed. > + * All virtual memory allocated to these failed modules will be lost with > + * no functional use. > + * > + * To help with this the dup_failed_modules allows us to track modules which > + * failed to load due to the fact that a module already was loaded or being > + * processed already. There are only two points at which we can fail such > + * calls, we list them below along with the number of virtual memory allocation > + * calls: > + * > + * a) FAIL_DUP_MOD_BECOMING: at the end of early_mod_check() before > + * layout_and_allocate(). This does not yet happen. > + * - with module decompression: 2 virtual memory allocation calls > + * - without module decompression: 1 virtual memory allocation calls > + * b) FAIL_DUP_MOD_LOAD: after layout_and_allocate() on add_unformed_module() > + * - with module decompression 3 virtual memory allocation calls > + * - without module decompression 2 virtual memory allocation calls > + * > + * We should strive to get this list to be as small as possible. If this list > + * is not empty it is a reflection of possible work or optimizations possible > + * either in-kernel or in userspace. > + */ > +static LIST_HEAD(dup_failed_modules); > + > +/** > + * DOC: module statistics debugfs counters > + * > + * The total amount of wasted virtual memory allocation space during module > + * loading can be computed by adding the total from the summation: > + * > + * * @invalid_kread_bytes + > + * @invalid_decompress_bytes + > + * @invalid_becoming_bytes + > + * @invalid_mod_bytes > + * > + * The following debugfs counters are available to inspect module loading > + * failures: > + * > + * * total_mod_size: total bytes ever used by all modules we've dealt with on > + * this system > + * * total_text_size: total bytes of the .text and .init.text ELF section > + * sizes we've dealt with on this system > + * * invalid_kread_bytes: bytes wasted in failures which happen due to > + * memory allocations with the initial kernel_read_file_from_fd(). > + * kernel_read_file_from_fd() uses vmalloc() and so these are wasted > + * vmalloc() memory allocations. These should typically not happen unless > + * your system is under memory pressure. > + * * invalid_decompress_bytes: number of bytes wasted due to > + * memory allocations in the module decompression path that use vmap(). > + * These typically should not happen unless your system is under memory > + * presssure. > + * * invalid_becoming_bytes: total number of bytes wasted due to > + * allocations used to read the kernel module userspace wants us to read > + * before we promote it to be processed to be added to our @modules linked > + * list. These failures could in theory happen in if we had a check in between > + * between a successful kernel_read_file_from_fd() call and right before > + * we allocate the our private memory for the module which would be kept if > + * the module is successfully loaded. The most common reason for this failure > + * is when userspace is racing to load a module which it does not yet see > + * loaded. The first module to succeed in add_unformed_module() will add a > + * module to our &modules list and subsequent loads of modules with the > + * same name will error out at the end of early_mod_check(). A check > + * for module_patient_check_exists() at the end of early_mod_check() could be > + * added to prevent duplicate allocations on layout_and_allocate() for > + * modules already being processed. These duplicate failed modules are > + * non-fatal, however they typically are indicative of userspace not seeing > + * a module in userspace loaded yet and unecessarily trying to load a > + * module before the kernel even has a chance to begin to process prior > + * requests. Although duplicate failures can be non-fatal, we should try to > + * reduce vmalloc() pressure proactively, so ideally after boot this will > + * be close to as 0 as possible. If module decompression was used we also > + * add to this counter the cost of the initial kernel_read_file_from_fd() > + * of the compressed module. If module decompression was not used the > + * value represents the total wasted allocations in kernel_read_file_from_fd() > + * calls for these type of failures. These failures can occur because: > + * > + * * module_sig_check() - module signature checks > + * * elf_validity_cache_copy() - some ELF validation issue > + * * early_mod_check(): > + * > + * * blacklisting > + * * failed to rewrite section headers > + * * version magic > + * * live patch requirements didn't check out > + * * the module was detected as being already present > + * > + * * invalid_mod_bytes: these are the total number of bytes lost due to > + * failures after we did all the sanity checks of the module which userspace > + * passed to us and after our first check that the module is unique. A > + * module can still fail to load if we detect the module is loaded after we > + * allocate space for it with layout_and_allocate(), we do this check right > + * before processing the module as live and run its initialiation routines. > + * Note that you have a failure of this type it also means the respective > + * kernel_read_file_from_fd() memory space was also wasted, and so we > + * increment this counter with twice the size of the module. Additionally > + * if you used module decompression the size of the compressed module is > + * also added to this counter. > + * > + * * modcount: how many modules we've loaded in our kernel life time > + * * failed_kreads: how many modules failed due to failed kernel_read_file_from_fd() > + * * failed_decompress: how many failed module decompression attempts we've had. > + * These really should not happen unless your compression / decompression > + * might be broken. > + * * failed_becoming: how many modules failed after we kernel_read_file_from_fd() > + * it and before we allocate memory for it with layout_and_allocate(). This > + * counter is never incremented if you manage to validate the module and > + * call layout_and_allocate() for it. > + * * failed_load_modules: how many modules failed once we've allocated our > + * private space for our module using layout_and_allocate(). These failures > + * should hopefully mostly be dealt with already. Races in theory could > + * still exist here, but it would just mean the kernel had started processing > + * two threads concurrently up to early_mod_check() and then one just one > + * thread won. These failures are good signs the kernel or userspace is > + * doing something seriously stupid or that could be improved. We should > + * strive to fix these, but it is perhaps not easy to fix them. > + * A recent example are the modules requests incurred for frequency modules, > + * a separate module request was being issued for each CPU on a system. > + */ > + > +atomic_long_t total_mod_size; > +atomic_long_t total_text_size; > +atomic_long_t invalid_kread_bytes; > +atomic_long_t invalid_decompress_bytes; > +static atomic_long_t invalid_becoming_bytes; > +static atomic_long_t invalid_mod_bytes; > +atomic_t modcount; > +atomic_t failed_kreads; > +atomic_t failed_decompress; > +static atomic_t failed_becoming; > +static atomic_t failed_load_modules; > + > +static const char *mod_fail_to_str(struct mod_fail_load *mod_fail) > +{ > + if (test_bit(FAIL_DUP_MOD_BECOMING, &mod_fail->dup_fail_mask) && > + test_bit(FAIL_DUP_MOD_LOAD, &mod_fail->dup_fail_mask)) > + return "Becoming & Load"; > + if (test_bit(FAIL_DUP_MOD_BECOMING, &mod_fail->dup_fail_mask)) > + return "Becoming"; > + if (test_bit(FAIL_DUP_MOD_LOAD, &mod_fail->dup_fail_mask)) > + return "Load"; > + return "Bug-on-stats"; > +} > + > +void mod_stat_bump_invalid(struct load_info *info, int flags) > +{ > + atomic_long_add(info->len * 2, &invalid_mod_bytes); > + atomic_inc(&failed_load_modules); > +#if defined(CONFIG_MODULE_DECOMPRESS) > + if (flags & MODULE_INIT_COMPRESSED_FILE) > + atomic_long_add(info->compressed_len, &invalid_mod_byte); Variable invalid_mod_byte is not declared, should be invalid_mod_bytes. > +#endif > +} > + > +void mod_stat_bump_becoming(struct load_info *info, int flags) > +{ > + atomic_inc(&failed_becoming); > + atomic_long_add(info->len, &invalid_becoming_bytes); > +#if defined(CONFIG_MODULE_DECOMPRESS) > + if (flags & MODULE_INIT_COMPRESSED_FILE) > + atomic_long_add(info->compressed_len, &invalid_becoming_bytes); > +#endif > +} > + > +int try_add_failed_module(const char *name, size_t len, enum fail_dup_mod_reason reason) Function try_add_failed_module() is only called from module_patient_check_exists() which always passes in a NUL-terminated string. The len parameter could be then dropped and the comparison in try_add_failed_module() could simply use strcmp(). > +{ > + struct mod_fail_load *mod_fail; > + > + list_for_each_entry_rcu(mod_fail, &dup_failed_modules, list, > + lockdep_is_held(&module_mutex)) { > + if (strlen(mod_fail->name) == len && !memcmp(mod_fail->name, name, len)) { > + atomic_long_inc(&mod_fail->count); > + __set_bit(reason, &mod_fail->dup_fail_mask); > + goto out; > + } > + } > + > + mod_fail = kzalloc(sizeof(*mod_fail), GFP_KERNEL); > + if (!mod_fail) > + return -ENOMEM; > + memcpy(mod_fail->name, name, len); > + __set_bit(reason, &mod_fail->dup_fail_mask); > + atomic_long_inc(&mod_fail->count); > + list_add_rcu(&mod_fail->list, &dup_failed_modules); > +out: > + return 0; > +} Indentation in try_add_failed_module() uses spaces instead of tabs in a few places. > + > +/* > + * At 64 bytes per module and assuming a 1024 bytes preamble we can fit the > + * 112 module prints within 8k. > + * > + * 1024 + (64*112) = 8k > + */ > +#define MAX_PREAMBLE 1024 > +#define MAX_FAILED_MOD_PRINT 112 > +#define MAX_BYTES_PER_MOD 64 > +static ssize_t read_file_mod_stats(struct file *file, char __user *user_buf, > + size_t count, loff_t *ppos) > +{ > + struct mod_fail_load *mod_fail; > + unsigned int len, size, count_failed = 0; > + char *buf; > + u32 live_mod_count, fkreads, fdecompress, fbecoming, floads; > + u64 total_size, text_size, ikread_bytes, ibecoming_bytes, idecompress_bytes, imod_bytes, > + total_virtual_lost; > + > + live_mod_count = atomic_read(&modcount); > + fkreads = atomic_read(&failed_kreads); > + fdecompress = atomic_read(&failed_decompress); > + fbecoming = atomic_read(&failed_becoming); > + floads = atomic_read(&failed_load_modules); > + > + total_size = atomic64_read(&total_mod_size); > + text_size = atomic64_read(&total_text_size); > + ikread_bytes = atomic64_read(&invalid_kread_bytes); > + idecompress_bytes = atomic64_read(&invalid_decompress_bytes); > + ibecoming_bytes = atomic64_read(&invalid_becoming_bytes); > + imod_bytes = atomic64_read(&invalid_mod_bytes); > + > + total_virtual_lost = ikread_bytes + idecompress_bytes + ibecoming_bytes + imod_bytes; > + > + size = MAX_PREAMBLE + min((unsigned int)(floads + fbecoming) * MAX_BYTES_PER_MOD, > + (unsigned int) MAX_FAILED_MOD_PRINT * MAX_BYTES_PER_MOD); Using 'size = MAX_PREAMBLE + min((unsigned int)(floads + fbecoming), (unsigned int)MAX_FAILED_MOD_PRINT) * MAX_BYTES_PER_MOD;' is a bit simpler and avoids any theoretical overflow of '(floads + fbecoming) * MAX_BYTES_PER_MOD'. > + buf = kzalloc(size, GFP_KERNEL); > + if (buf == NULL) > + return -ENOMEM; > + > + /* The beginning of our debug preamble */ > + len = scnprintf(buf + 0, size - len, "%25s\t%u\n", "Mods ever loaded", live_mod_count); > + > + len += scnprintf(buf + len, size - len, "%25s\t%u\n", "Mods failed on kread", fkreads); > + > + len += scnprintf(buf + len, size - len, "%25s\t%u\n", "Mods failed on decompress", > + fdecompress); > + len += scnprintf(buf + len, size - len, "%25s\t%u\n", "Mods failed on becoming", fbecoming); > + > + len += scnprintf(buf + len, size - len, "%25s\t%u\n", "Mods failed on load", floads); > + > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Total module size", total_size); > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Total mod text size", text_size); > + > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Failed kread bytes", ikread_bytes); > + > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Failed decompress bytes", > + idecompress_bytes); > + > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Failed becoming bytes", ibecoming_bytes); > + > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Failed kmod bytes", imod_bytes); > + > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Virtual mem wasted bytes", total_virtual_lost); > + > + if (live_mod_count && total_size) { > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Average mod size", > + DIV_ROUND_UP(total_size, live_mod_count)); > + } > + > + if (live_mod_count && text_size) { > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Average mod text size", > + DIV_ROUND_UP(text_size, live_mod_count)); > + } > + > + /* > + * We use WARN_ON_ONCE() for the counters to ensure we always have parity > + * for keeping tabs on a type of failure with one type of byte counter. > + * The counters for imod_bytes does not increase for fkreads failures > + * for example, and so on. > + */ > + > + WARN_ON_ONCE(ikread_bytes && !fkreads); > + if (fkreads && ikread_bytes) { > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Avg fail kread bytes", > + DIV_ROUND_UP(ikread_bytes, fkreads)); > + } > + > + WARN_ON_ONCE(ibecoming_bytes && !fbecoming); > + if (fbecoming && ibecoming_bytes) { > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Avg fail becoming bytes", > + DIV_ROUND_UP(ibecoming_bytes, fbecoming)); > + } > + > + WARN_ON_ONCE(idecompress_bytes && !fdecompress); > + if (fdecompress && idecompress_bytes) { > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Avg fail decomp bytes", > + DIV_ROUND_UP(idecompress_bytes, fdecompress)); > + } > + > + WARN_ON_ONCE(imod_bytes && !floads); > + if (floads && imod_bytes) { > + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Average fail load bytes", > + DIV_ROUND_UP(imod_bytes, floads)); > + } > + > + /* End of our debug preamble header. */ > + > + /* Catch when we've gone beyond our expected preamble */ > + WARN_ON_ONCE(len >= MAX_PREAMBLE); > + > + if (list_empty(&dup_failed_modules)) > + goto out; > + > + len += scnprintf(buf + len, size - len, "Duplicate failed modules:\n"); > + len += scnprintf(buf + len, size - len, "%25s\t%15s\t%25s\n", > + "module-name", "How-many-times", "Reason"); "module-name" -> "Module-name" > + mutex_lock(&module_mutex); > + > + > + list_for_each_entry_rcu(mod_fail, &dup_failed_modules, list) { > + if (WARN_ON_ONCE(++count_failed >= MAX_FAILED_MOD_PRINT)) > + goto out_unlock; > + len += scnprintf(buf + len, size - len, "%25s\t%15llu\t%25s\n", mod_fail->name, > + atomic64_read(&mod_fail->count), mod_fail_to_str(mod_fail)); > + } > +out_unlock: > + mutex_unlock(&module_mutex); > +out: > + kfree(buf); > + return simple_read_from_buffer(user_buf, count, ppos, buf, len); > +} > +#undef MAX_PREAMBLE > +#undef MAX_FAILED_MOD_PRINT > +#undef MAX_BYTES_PER_MOD > + > +static const struct file_operations fops_mod_stats = { > + .read = read_file_mod_stats, > + .open = simple_open, > + .owner = THIS_MODULE, > + .llseek = default_llseek, > +}; > + > +#define mod_debug_add_ulong(name) debugfs_create_ulong(#name, 0400, mod_debugfs_root, (unsigned long *) &name.counter) > +#define mod_debug_add_atomic(name) debugfs_create_atomic_t(#name, 0400, mod_debugfs_root, &name) > +static int __init module_stats_init(void) > +{ > + mod_debug_add_ulong(total_mod_size); > + mod_debug_add_ulong(total_text_size); > + mod_debug_add_ulong(invalid_kread_bytes); > + mod_debug_add_ulong(invalid_decompress_bytes); > + mod_debug_add_ulong(invalid_becoming_bytes); > + mod_debug_add_ulong(invalid_mod_bytes); > + > + mod_debug_add_atomic(modcount); > + mod_debug_add_atomic(failed_kreads); > + mod_debug_add_atomic(failed_decompress); > + mod_debug_add_atomic(failed_becoming); > + mod_debug_add_atomic(failed_load_modules); > + > + debugfs_create_file("stats", 0400, mod_debugfs_root, mod_debugfs_root, &fops_mod_stats); > + > + return 0; > +} > +#undef mod_debug_add_ulong > +#undef mod_debug_add_atomic > +module_init(module_stats_init); Function module_stats_init() requires mod_debugfs_root being initialized which is done in module_debugfs_init(). Both functions are recorded to be called via module_init(). Just to make sure, is their ordering guaranteed in some way? > diff --git a/kernel/module/tracking.c b/kernel/module/tracking.c > index 26d812e07615..16742d1c630c 100644 > --- a/kernel/module/tracking.c > +++ b/kernel/module/tracking.c > @@ -15,6 +15,7 @@ > #include "internal.h" > > static LIST_HEAD(unloaded_tainted_modules); > +extern struct dentry *mod_debugfs_root; > > int try_add_tainted_module(struct module *mod) > { > @@ -120,12 +121,8 @@ static const struct file_operations unloaded_tainted_modules_fops = { > > static int __init unloaded_tainted_modules_init(void) > { > - struct dentry *dir; > - > - dir = debugfs_create_dir("modules", NULL); > - debugfs_create_file("unloaded_tainted", 0444, dir, NULL, > + debugfs_create_file("unloaded_tainted", 0444, mod_debugfs_root, NULL, > &unloaded_tainted_modules_fops); mod_debugfs_root is initialized in module_debugfs_init() only if CONFIG_MODULE_DEBUG is set. However, my reading is that feature CONFIG_MODULE_UNLOAD_TAINT_TRACKING is orthogonal to it and doesn't require CONFIG_MODULE_DEBUG, so it looks this change breaks this tracking? > - > return 0; > } > module_init(unloaded_tainted_modules_init); Cheers, Petr