From: Adrian Huang <ahuang12@xxxxxxxxxx> When compiling kernel source 'make -j $(nproc)' with the up-and-running KASAN-enabled kernel on a 256-core machine, the following soft lockup is shown: watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [kworker/28:1:1760] CPU: 28 PID: 1760 Comm: kworker/28:1 Kdump: loaded Not tainted 6.10.0-rc5 #95 Workqueue: events drain_vmap_area_work RIP: 0010:smp_call_function_many_cond+0x1d8/0xbb0 Code: 38 c8 7c 08 84 c9 0f 85 49 08 00 00 8b 45 08 a8 01 74 2e 48 89 f1 49 89 f7 48 c1 e9 03 41 83 e7 07 4c 01 e9 41 83 c7 03 f3 90 <0f> b6 01 41 38 c7 7c 08 84 c0 0f 85 d4 06 00 00 8b 45 08 a8 01 75 RSP: 0018:ffffc9000cb3fb60 EFLAGS: 00000202 RAX: 0000000000000011 RBX: ffff8883bc4469c0 RCX: ffffed10776e9949 RDX: 0000000000000002 RSI: ffff8883bb74ca48 RDI: ffffffff8434dc50 RBP: ffff8883bb74ca40 R08: ffff888103585dc0 R09: ffff8884533a1800 R10: 0000000000000004 R11: ffffffffffffffff R12: ffffed1077888d39 R13: dffffc0000000000 R14: ffffed1077888d38 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff8883bc400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005577b5c8d158 CR3: 0000000004850000 CR4: 0000000000350ef0 Call Trace: <IRQ> ? watchdog_timer_fn+0x2cd/0x390 ? __pfx_watchdog_timer_fn+0x10/0x10 ? __hrtimer_run_queues+0x300/0x6d0 ? sched_clock_cpu+0x69/0x4e0 ? __pfx___hrtimer_run_queues+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? ktime_get_update_offsets_now+0x7f/0x2a0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? hrtimer_interrupt+0x2ca/0x760 ? __sysvec_apic_timer_interrupt+0x8c/0x2b0 ? sysvec_apic_timer_interrupt+0x6a/0x90 </IRQ> <TASK> ? asm_sysvec_apic_timer_interrupt+0x16/0x20 ? smp_call_function_many_cond+0x1d8/0xbb0 ? __pfx_do_kernel_range_flush+0x10/0x10 on_each_cpu_cond_mask+0x20/0x40 flush_tlb_kernel_range+0x19b/0x250 ? srso_return_thunk+0x5/0x5f ? kasan_release_vmalloc+0xa7/0xc0 purge_vmap_node+0x357/0x820 ? __pfx_purge_vmap_node+0x10/0x10 __purge_vmap_area_lazy+0x5b8/0xa10 drain_vmap_area_work+0x21/0x30 process_one_work+0x661/0x10b0 worker_thread+0x844/0x10e0 ? srso_return_thunk+0x5/0x5f ? __kthread_parkme+0x82/0x140 ? __pfx_worker_thread+0x10/0x10 kthread+0x2a5/0x370 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Debugging Analysis: 1. [Crash] The call trace indicates CPU #28 is waiting for other CPUs' responses by sending an IPI message in order to flush tlb. However, crash indicates the target CPU has responded. A. CPU #28 is waiting for CPU #2' response. crash> cfd_data | grep -w 28 [28]: ffff8883bc4469c0 crash> struct call_function_data ffff8883bc4469c0 struct call_function_data { csd = 0x4ca40, cpumask = 0xffff888103585e40, cpumask_ipi = 0xffff8881035858c0 } crash> struct __call_single_data 0x4ca40:a | grep ffff8883bb74ca40 [2]: ffff8883bb74ca40 B. CPU #2 has responded because the bit 'CSD_FLAG_LOCK' of u_flags is cleared. crash> struct __call_single_data 0xffff8883bb74ca40 struct __call_single_data { node = { ... { u_flags = 0, a_flags = { counter = 0 } }, ... }, func = 0xffffffff8117b080 <do_kernel_range_flush>, info = 0xffff8883bc444940 } C. CPU #2 is running userspace application 'nm'. crash> bt -c 2 PID: 52035 TASK: ffff888194c21ac0 CPU: 2 COMMAND: "nm" #0 [ffffc90043157ea8] crash_nmi_callback at ffffffff81122f42 #1 [ffffc90043157eb0] nmi_handle at ffffffff8108c988 #2 [ffffc90043157f10] default_do_nmi at ffffffff835b3460 #3 [ffffc90043157f30] exc_nmi at ffffffff835b3630 #4 [ffffc90043157f50] asm_exc_nmi at ffffffff83601573 RIP: 00007f6261b90d38 RSP: 00007ffe4d1cb180 RFLAGS: 00000202 RAX: 0000000000000001 RBX: 6e6f2d7865646e69 RCX: 00007f626281f634 RDX: 00007f6262ba7f67 RSI: 00007f626265f65e RDI: 00007f62648f8a30 RBP: 2d6f746c6e696874 R8: 00007f62618a4cc0 R9: 0000000000000001 R10: 00007f627233e488 R11: 00007f627233d768 R12: 000055bee0ff4b90 R13: 000055bee0fac650 R14: 00007ffe4d1cb280 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b D. The soft lockup CPU iteratively traverses 128 vmap_nodes (128 bits are set) and flushes tlb. This might be the time-consuming work. crash> p /x purge_nodes $1 = { bits = {0xffffffffffffffff, 0xffffffffffffffff, 0x0, ...} 2. [Ftrace] In order to prove that the soft lockup CPU spends too much time iterating vmap_nodes and flushing tlb when purging vm_area structures, the ftrace confirms the speculation (Some info is trimmed). kworker: funcgraph_entry: | drain_vmap_area_work() { kworker: funcgraph_entry: | mutex_lock() { kworker: funcgraph_entry: 1.092 us | __cond_resched(); kworker: funcgraph_exit: 3.306 us | } ... ... kworker: funcgraph_entry: | flush_tlb_kernel_range() { ... ... kworker: funcgraph_exit: # 7533.649 us | } ... ... kworker: funcgraph_entry: 2.344 us | mutex_unlock(); kworker: funcgraph_exit: $ 23871554 us | } The drain_vmap_area_work() spends over 23 seconds. There are 2805 flush_tlb_kernel_range() calls in this ftrace log. * One is called in __purge_vmap_area_lazy(). * Others are called in kasan_release_vmalloc(). 3. Extending the soft lockup time can work around the issue (For example, # echo 60 > /proc/sys/kernel/watchdog_thresh). This confirms the above-mentioned speculation: drain_vmap_area_work() spends too much time. Commit 72210662c5a2 ("mm: vmalloc: offload free_vmap_area_lock lock") employs an effective vmap node logic. However, current design iterates 128 vmap_nodes and flushes tlb by a single CPU if vmap_lazy_nr < 2 * lazy_max_pages(). When enabling kasan, this might trigger the soft lockup because additional tlb flushes of kasan vmalloc spend much more time if 128 vmap nodes have the available purge list. Fix the issue by adding preempt point in purge_vmap_node() when enabling kasan. Fixes: 72210662c5a2 ("mm: vmalloc: offload free_vmap_area_lock lock") Signed-off-by: Adrian Huang <ahuang12@xxxxxxxxxx> Reviewed-and-tested-by: Jiwei Sun <sunjw10@xxxxxxxxxx> --- mm/vmalloc.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d0cbdd7c1e5b..380bdc317c8d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2193,6 +2193,15 @@ static void purge_vmap_node(struct work_struct *work) struct vmap_area *va, *n_va; LIST_HEAD(local_list); + /* + * Add the preemption point when enabling KASAN. Each vmap_area of + * vmap nodes has to flush tlb when releasing vmalloc, which might + * be the time-consuming work if lots of vamp nodes have the + * available purge list. + */ + if (kasan_enabled()) + cond_resched(); + vn->nr_purged = 0; list_for_each_entry_safe(va, n_va, &vn->purge_list, list) { -- 2.34.1