Re: KASAN-related VMAP allocation errors in debug kernels with many logical CPUS

David Hildenbrand <david@xxxxxxxxxx> · Thu, 13 Oct 2022 18:21:17 +0200

OK. It is related to a module vmap space allocation when a module is
inserted. I wounder why it requires 2.5MB for a module? It seems a lot
to me.


Indeed. I assume KASAN can go wild when it instruments each and every 
memory access.


Really looks like only module vmap space. ~ 1 GiB of vmap module space ...

If an allocation request for a module is 2.5MB we can load ~400 modules
having 1GB address space.

"lsmod | wc -l"? How many modules your system has?


~71, so not even close to 400.

What I find interesting is that we have these recurring allocations of similar sizes failing.
I wonder if user space is capable of loading the same kernel module concurrently to
trigger a massive amount of allocations, and module loading code only figures out
later that it has already been loaded and backs off.

If there is a request about allocating memory it has to be succeeded
unless there are some errors like no space no memory.

Yes. But as I found out we're really out of space because module loading 
code allocates module VMAP space first, before verifying if the module 
was already loaded or is concurrently getting loaded.

See below.

[...]

I wrote a small patch to dump a modules address space when a fail occurs:

<snip v6.0>

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 83b54beb12fa..88d323310df5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1580,6 +1580,37 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
  		kmem_cache_free(vmap_area_cachep, va);
  }
  
+static void
+dump_modules_free_space(unsigned long vstart, unsigned long vend)
+{
+	unsigned long va_start, va_end;
+	unsigned int total = 0;
+	struct vmap_area *va;
+
+	if (vend != MODULES_END)
+		return;
+
+	trace_printk("--- Dump a modules address space: 0x%lx - 0x%lx\n", vstart, vend);
+
+	spin_lock(&free_vmap_area_lock);
+	list_for_each_entry(va, &free_vmap_area_list, list) {
+		va_start = (va->va_start > vstart) ? va->va_start:vstart;
+		va_end = (va->va_end < vend) ? va->va_end:vend;
+
+		if (va_start >= va_end)
+			continue;
+
+		if (va_start >= vstart && va_end <= vend) {
+			trace_printk(" va_free: 0x%lx - 0x%lx size=%lu\n",
+				va_start, va_end, va_end - va_start);
+			total += (va_end - va_start);
+		}
+	}
+
+	spin_unlock(&free_vmap_area_lock);
+	trace_printk("--- Total free: %u ---\n", total);
+}
+
  /*
   * Allocate a region of KVA of the specified size and alignment, within the
   * vstart and vend.
@@ -1663,10 +1694,13 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
  		goto retry;
  	}
  
-	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit())
+	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
  		pr_warn("vmap allocation for size %lu failed: use vmalloc=<size> to increase size\n",
  			size);
  
+		dump_modules_free_space();
+	}
+
  	kmem_cache_free(vmap_area_cachep, va);
  	return ERR_PTR(-EBUSY);
  }

Thanks!

I can spot the same module getting loaded over and over again 
concurrently from user space, only failing after all the allocations 
when realizing that the module is in fact already loaded in 
add_unformed_module(), failing with -EEXIST.

That looks quite inefficient. Here is how often user space tries to load 
the same module on that system. Note that I print *after* allocating 
module VMAP space.

# dmesg | grep Loading | cut -d" " -f5 | sort | uniq -c
    896 acpi_cpufreq
      1 acpi_pad
      1 acpi_power_meter
      2 ahci
      1 cdrom
      2 compiled-in
      1 coretemp
     15 crc32c_intel
    307 crc32_pclmul
      1 crc64
      1 crc64_rocksoft
      1 crc64_rocksoft_generic
     12 crct10dif_pclmul
     16 dca
      1 dm_log
      1 dm_mirror
      1 dm_mod
      1 dm_region_hash
      1 drm
      1 drm_kms_helper
      1 drm_shmem_helper
      1 fat
      1 fb_sys_fops
     14 fjes
      1 fuse
    205 ghash_clmulni_intel
      1 i2c_algo_bit
      1 i2c_i801
      1 i2c_smbus
      4 i40e
      4 ib_core
      1 ib_uverbs
      4 ice
    403 intel_cstate
      1 intel_pch_thermal
      1 intel_powerclamp
      1 intel_rapl_common
      1 intel_rapl_msr
    399 intel_uncore
      1 intel_uncore_frequency
      1 intel_uncore_frequency_common
     64 ioatdma
      1 ipmi_devintf
      1 ipmi_msghandler
      1 ipmi_si
      1 ipmi_ssif
      4 irdma
    406 irqbypass
      1 isst_if_common
    165 isst_if_mbox_msr
    300 kvm
    408 kvm_intel
      1 libahci
      2 libata
      1 libcrc32c
    409 libnvdimm
      8 Loading
      1 lpc_ich
      1 megaraid_sas
      1 mei
      1 mei_me
      1 mgag200
      1 nfit
      1 pcspkr
      1 qrtr
    405 rapl
      1 rfkill
      1 sd_mod
      2 sg
    409 skx_edac
      1 sr_mod
      1 syscopyarea
      1 sysfillrect
      1 sysimgblt
      1 t10_pi
      1 uas
      1 usb_storage
      1 vfat
      1 wmi
      1 x86_pkg_temp_thermal
      1 xfs


For each if these loading request, we'll reserve module VMAP space, and 
free it once we realize later that the module was already previously loaded.

So with a lot of CPUs we might end up trying to load the same module 
that often at the same time that we actually run out of module VMAP space.

I have a prototype patch that seems to fix this in module loading code.

Thanks!

--
Thanks,

David / dhildenb