KASAN-related VMAP allocation errors in debug kernels with many logical CPUS

David Hildenbrand <david@xxxxxxxxxx> · Thu, 6 Oct 2022 15:46:59 +0200

Hi,

we're currently hitting a weird vmap issue in debug kernels with KASAN enabled
on fairly large VMs. I reproduced it on v5.19 (did not get the chance to
try 6.0 yet because I don't have access to the machine right now, but
I suspect it persists).

It seems to trigger when udev probes a massive amount of devices in parallel
while the system is booting up. Once the system booted, I no longer see any
such issues.

[  165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.836622] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
[  165.837461] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
[  165.840573] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.841059] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.841428] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.841819] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.842123] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.843359] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.844894] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[  165.847028] CPU: 253 PID: 4995 Comm: systemd-udevd Not tainted 5.19.0 #2
[  165.935689] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
[  165.947343] Call Trace:
[  165.950075]  <TASK>
[  165.952425]  dump_stack_lvl+0x57/0x81
[  165.956532]  warn_alloc.cold+0x95/0x18a
[  165.960836]  ? zone_watermark_ok_safe+0x240/0x240
[  165.966100]  ? slab_free_freelist_hook+0x11d/0x1d0
[  165.971461]  ? __get_vm_area_node+0x2af/0x360
[  165.976341]  ? __get_vm_area_node+0x2af/0x360
[  165.981219]  __vmalloc_node_range+0x291/0x560
[  165.986087]  ? __mutex_unlock_slowpath+0x161/0x5e0
[  165.991447]  ? move_module+0x4c/0x630
[  165.995547]  ? vfree_atomic+0xa0/0xa0
[  165.999647]  ? move_module+0x4c/0x630
[  166.003741]  module_alloc+0xe7/0x170
[  166.007747]  ? move_module+0x4c/0x630
[  166.011840]  move_module+0x4c/0x630
[  166.015751]  layout_and_allocate+0x32c/0x560
[  166.020519]  load_module+0x8e0/0x25c0
[  166.024623]  ? layout_and_allocate+0x560/0x560
[  166.029586]  ? kernel_read_file+0x286/0x6b0
[  166.034269]  ? __x64_sys_fspick+0x290/0x290
[  166.038946]  ? userfaultfd_unmap_prep+0x430/0x430
[  166.044203]  ? lock_downgrade+0x130/0x130
[  166.048698]  ? __do_sys_finit_module+0x11a/0x1c0
[  166.053854]  __do_sys_finit_module+0x11a/0x1c0
[  166.058818]  ? __ia32_sys_init_module+0xa0/0xa0
[  166.063882]  ? __seccomp_filter+0x92/0x930
[  166.068494]  do_syscall_64+0x59/0x90
[  166.072492]  ? do_syscall_64+0x69/0x90
[  166.076679]  ? do_syscall_64+0x69/0x90
[  166.080864]  ? do_syscall_64+0x69/0x90
[  166.085047]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[  166.090984]  ? lockdep_hardirqs_on+0x79/0x100
[  166.095855]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

Some facts:

1. The #CPUs seems to be more important than the #MEM

Initially we thought the memory size would be the relevant trigger,
because we've only seen it on 8TiB machines. But I was able to
reproduce also on a "small" machine with ~450GiB.

We've seen this issue only on machines with a lot (~448) logical CPUs.

On such systems, I was not able to reproduce when booting the kernel with
"nosmt" so far, which could indicate some kind of concurrency problem.

2. CONFIG_KASAN_INLINE seems to be relevant

This issue only seems to trigger with KASAN enabled, and what I can tell,
only with CONFIG_KASAN_INLINE=y:

CONFIG_KASAN_INLINE: "but makes the kernel's .text size much bigger.", that
should include kernel module to be loaded.

3. All systems have 8, equally sized NUMA nodes

... which implies, that at least one node is practically completely filled with
KASAN data. I remember adjusting the system size with "mem=", such that some
nodes were memory-less but NODE 0 would still have some free memory.
I remember that it still triggered.

My current best guess is that this is a combination of large VMAP demands
(e.g., kernel modules with quite a size due to CONFIG_KASAN_INLINE) and
eventually a concurrency issue with large #CPUs. But I might be wrong and
this might be something zone/node related.

Does any of that ring a bell -- especially why it would fail with 448 logical
CPUs but succeed with 224 logical CPUs (nosmt)?

My best guess would be that the purge_vmap_area_lazy() logic in alloc_vmap_area()
might not be sufficient when there is a lot of concurrency: simply purging
once and then failing might be problematic in corner cases where there is a lot of
concurrent vmap action going on. But that's just my best guess.

Cheers!

--
Thanks,

David / dhildenb