[PATCH 1/1] mm/vmalloc: Add preempt point in purge_vmap_node() when enabling kasan

Adrian Huang <adrianhuang0701@xxxxxxxxx> · Fri, 5 Jul 2024 21:08:08 +0800

From: Adrian Huang <ahuang12@xxxxxxxxxx>

When compiling kernel source 'make -j $(nproc)' with the up-and-running
KASAN-enabled kernel on a 256-core machine, the following soft lockup
is shown:

watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [kworker/28:1:1760]
CPU: 28 PID: 1760 Comm: kworker/28:1 Kdump: loaded Not tainted 6.10.0-rc5 #95
Workqueue: events drain_vmap_area_work
RIP: 0010:smp_call_function_many_cond+0x1d8/0xbb0
Code: 38 c8 7c 08 84 c9 0f 85 49 08 00 00 8b 45 08 a8 01 74 2e 48 89 f1 49 89 f7 48 c1 e9 03 41 83 e7 07 4c 01 e9 41 83 c7 03 f3 90 <0f> b6 01 41 38 c7 7c 08 84 c0 0f 85 d4 06 00 00 8b 45 08 a8 01 75
RSP: 0018:ffffc9000cb3fb60 EFLAGS: 00000202
RAX: 0000000000000011 RBX: ffff8883bc4469c0 RCX: ffffed10776e9949
RDX: 0000000000000002 RSI: ffff8883bb74ca48 RDI: ffffffff8434dc50
RBP: ffff8883bb74ca40 R08: ffff888103585dc0 R09: ffff8884533a1800
R10: 0000000000000004 R11: ffffffffffffffff R12: ffffed1077888d39
R13: dffffc0000000000 R14: ffffed1077888d38 R15: 0000000000000003
FS:  0000000000000000(0000) GS:ffff8883bc400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005577b5c8d158 CR3: 0000000004850000 CR4: 0000000000350ef0
Call Trace:
 <IRQ>
 ? watchdog_timer_fn+0x2cd/0x390
 ? __pfx_watchdog_timer_fn+0x10/0x10
 ? __hrtimer_run_queues+0x300/0x6d0
 ? sched_clock_cpu+0x69/0x4e0
 ? __pfx___hrtimer_run_queues+0x10/0x10
 ? srso_return_thunk+0x5/0x5f
 ? ktime_get_update_offsets_now+0x7f/0x2a0
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? hrtimer_interrupt+0x2ca/0x760
 ? __sysvec_apic_timer_interrupt+0x8c/0x2b0
 ? sysvec_apic_timer_interrupt+0x6a/0x90
 </IRQ>
 <TASK>
 ? asm_sysvec_apic_timer_interrupt+0x16/0x20
 ? smp_call_function_many_cond+0x1d8/0xbb0
 ? __pfx_do_kernel_range_flush+0x10/0x10
 on_each_cpu_cond_mask+0x20/0x40
 flush_tlb_kernel_range+0x19b/0x250
 ? srso_return_thunk+0x5/0x5f
 ? kasan_release_vmalloc+0xa7/0xc0
 purge_vmap_node+0x357/0x820
 ? __pfx_purge_vmap_node+0x10/0x10
 __purge_vmap_area_lazy+0x5b8/0xa10
 drain_vmap_area_work+0x21/0x30
 process_one_work+0x661/0x10b0
 worker_thread+0x844/0x10e0
 ? srso_return_thunk+0x5/0x5f
 ? __kthread_parkme+0x82/0x140
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x2a5/0x370
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x30/0x70
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

Debugging Analysis:
 1. [Crash] The call trace indicates CPU #28 is waiting for other CPUs'
    responses by sending an IPI message in order to flush tlb.
    However, crash indicates the target CPU has responded.

     A. CPU #28 is waiting for CPU #2' response.

	crash> cfd_data | grep -w 28
	  [28]: ffff8883bc4469c0
	crash> struct call_function_data ffff8883bc4469c0
	struct call_function_data {
	  csd = 0x4ca40,
	  cpumask = 0xffff888103585e40,
	  cpumask_ipi = 0xffff8881035858c0
	}

        crash> struct __call_single_data 0x4ca40:a | grep ffff8883bb74ca40
        [2]: ffff8883bb74ca40

     B. CPU #2 has responded because the bit 'CSD_FLAG_LOCK' of u_flags
        is cleared.

        crash> struct __call_single_data 0xffff8883bb74ca40
        struct __call_single_data {
          node = {
            ...
            {
              u_flags = 0,
              a_flags = {
                counter = 0
              }
            },
	    ...
          },
          func = 0xffffffff8117b080 <do_kernel_range_flush>,
          info = 0xffff8883bc444940
        }

     C. CPU #2 is running userspace application 'nm'.

        crash> bt -c 2
	PID: 52035  TASK: ffff888194c21ac0  CPU: 2   COMMAND: "nm"
 	#0 [ffffc90043157ea8] crash_nmi_callback at ffffffff81122f42
 	#1 [ffffc90043157eb0] nmi_handle at ffffffff8108c988
	#2 [ffffc90043157f10] default_do_nmi at ffffffff835b3460
 	#3 [ffffc90043157f30] exc_nmi at ffffffff835b3630
 	#4 [ffffc90043157f50] asm_exc_nmi at ffffffff83601573
        RIP: 00007f6261b90d38 RSP: 00007ffe4d1cb180 RFLAGS: 00000202
        RAX: 0000000000000001 RBX: 6e6f2d7865646e69 RCX: 00007f626281f634
        RDX: 00007f6262ba7f67 RSI: 00007f626265f65e RDI: 00007f62648f8a30
        RBP: 2d6f746c6e696874  R8: 00007f62618a4cc0  R9: 0000000000000001
        R10: 00007f627233e488 R11: 00007f627233d768 R12: 000055bee0ff4b90
        R13: 000055bee0fac650 R14: 00007ffe4d1cb280 R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

     D. The soft lockup CPU iteratively traverses 128 vmap_nodes (128 bits
	are set) and flushes tlb. This might be the time-consuming work.

	crash> p /x purge_nodes
	$1 = {
	  bits = {0xffffffffffffffff, 0xffffffffffffffff, 0x0, ...}

 2. [Ftrace] In order to prove that the soft lockup CPU spends too much
    time iterating vmap_nodes and flushing tlb when purging vm_area
    structures, the ftrace confirms the speculation (Some info is trimmed).

     kworker: funcgraph_entry:		    |  drain_vmap_area_work() {
     kworker: funcgraph_entry:              |   mutex_lock() {
     kworker: funcgraph_entry:  1.092 us    |     __cond_resched();
     kworker: funcgraph_exit:   3.306 us    |   }
     ...				        ...
     kworker: funcgraph_entry: 		    |    flush_tlb_kernel_range() {
     ...				 	  ...
     kworker: funcgraph_exit: # 7533.649 us |    }
     ...                                         ...
     kworker: funcgraph_entry:  2.344 us    |   mutex_unlock();
     kworker: funcgraph_exit: $ 23871554 us | }

     The drain_vmap_area_work() spends over 23 seconds.

     There are 2805 flush_tlb_kernel_range() calls in this ftrace log.
       * One is called in __purge_vmap_area_lazy().
       * Others are called in kasan_release_vmalloc().

 3. Extending the soft lockup time can work around the issue (For example,
    # echo 60 > /proc/sys/kernel/watchdog_thresh). This confirms the
    above-mentioned speculation: drain_vmap_area_work() spends too much
    time.

Commit 72210662c5a2 ("mm: vmalloc: offload free_vmap_area_lock lock")
employs an effective vmap node logic. However, current design iterates
128 vmap_nodes and flushes tlb by a single CPU if
vmap_lazy_nr < 2 * lazy_max_pages(). When enabling kasan, this might
trigger the soft lockup because additional tlb flushes of kasan vmalloc
spend much more time if 128 vmap nodes have the available purge list.

Fix the issue by adding preempt point in purge_vmap_node() when
enabling kasan.

Fixes: 72210662c5a2 ("mm: vmalloc: offload free_vmap_area_lock lock")
Signed-off-by: Adrian Huang <ahuang12@xxxxxxxxxx>
Reviewed-and-tested-by: Jiwei Sun <sunjw10@xxxxxxxxxx>
---
 mm/vmalloc.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d0cbdd7c1e5b..380bdc317c8d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2193,6 +2193,15 @@ static void purge_vmap_node(struct work_struct *work)
 	struct vmap_area *va, *n_va;
 	LIST_HEAD(local_list);
 
+	/*
+	 * Add the preemption point when enabling KASAN. Each vmap_area of
+	 * vmap nodes has to flush tlb when releasing vmalloc, which might
+	 * be the time-consuming work if lots of vamp nodes have the
+	 * available purge list.
+	 */
+	if (kasan_enabled())
+		cond_resched();
+
 	vn->nr_purged = 0;
 
 	list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
-- 
2.34.1