From: Adrian Huang <ahuang12@xxxxxxxxxx> When running the test_vmalloc stress on a 448-core server, the following soft/hard lockups were observed and the OS was panicked eventually. 1) Kernel config CONFIG_KASAN=y CONFIG_KASAN_VMALLOC=y 2) Reproduced command # modprobe test_vmalloc nr_threads=448 run_test_mask=0x1 nr_pages=8 3) OS Log: Detail is in [1]. watchdog: BUG: soft lockup - CPU#258 stuck for 26s! RIP: 0010:native_queued_spin_lock_slowpath+0x504/0x940 Call Trace: do_raw_spin_lock+0x1e7/0x270 _raw_spin_lock+0x63/0x80 kasan_depopulate_vmalloc_pte+0x3c/0x70 apply_to_pte_range+0x127/0x4e0 apply_to_pmd_range+0x19e/0x5c0 apply_to_pud_range+0x167/0x510 __apply_to_page_range+0x2b4/0x7c0 kasan_release_vmalloc+0xc8/0xd0 purge_vmap_node+0x190/0x980 __purge_vmap_area_lazy+0x640/0xa60 drain_vmap_area_work+0x23/0x30 process_one_work+0x84a/0x1760 worker_thread+0x54d/0xc60 kthread+0x2a8/0x380 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 ... watchdog: Watchdog detected hard LOCKUP on cpu 8 watchdog: Watchdog detected hard LOCKUP on cpu 42 watchdog: Watchdog detected hard LOCKUP on cpu 10 ... Shutting down cpus with NMI Kernel Offset: disabled pstore: backend (erst) writing error (-28) ---[ end Kernel panic - not syncing: Hard LOCKUP ]--- BTW, the issue can be also reproduced on a 192-core server and a 256-core server. [Root Cause] The tight loop in kasan_release_vmalloc_node() iteratively calls kasan_release_vmalloc() to clear the corresponding PTE, which acquires/releases "init_mm.page_table_lock" in kasan_depopulate_vmalloc_pte(). The lock_stat shows that the "init_mm.page_table_lock" is the first entry of top list of the contentions. This lock_stat info is based on the following command (in order not to get OS panicked), where the max wait time is 600ms: # modprobe test_vmalloc nr_threads=150 run_test_mask=0x1 nr_pages=8 <snip> ------------------------------------------------------------------ class name con-bounces contentions waittime-min waittime-max ... ------------------------------------------------------------------ init_mm.page_table_lock: 87859653 93020601 0.27 600304.90 ... ----------------------- init_mm.page_table_lock 54332301 [<000000008ce229be>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120 init_mm.page_table_lock 6680902 [<000000009c0800ad>] __pte_alloc_kernel+0x9b/0x370 init_mm.page_table_lock 31991077 [<00000000180bc35d>] kasan_depopulate_vmalloc_pte+0x3c/0x70 init_mm.page_table_lock 16321 [<000000003ef0e79b>] __pmd_alloc+0x1d5/0x720 ----------------------- init_mm.page_table_lock 50278552 [<000000008ce229be>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120 init_mm.page_table_lock 5725380 [<000000009c0800ad>] __pte_alloc_kernel+0x9b/0x370 init_mm.page_table_lock 36992410 [<00000000180bc35d>] kasan_depopulate_vmalloc_pte+0x3c/0x70 init_mm.page_table_lock 24259 [<000000003ef0e79b>] __pmd_alloc+0x1d5/0x720 ... <snip> [Solution] After re-visiting code path about setting the kasan ptep (pte pointer), it's unlikely that a kasan ptep is set and cleared simultaneously by different CPUs. So, use ptep_get_and_clear() to get rid of the spinlock operation. The result shows the max wait time is 13ms with the following command (448 cores are fully stressed): # modprobe test_vmalloc nr_threads=448 run_test_mask=0x1 nr_pages=8 <snip> ------------------------------------------------------------------ class name con-bounces contentions waittime-min waittime-max ... ------------------------------------------------------------------ init_mm.page_table_lock: 109999304 110008477 0.27 13534.76 ----------------------- init_mm.page_table_lock 109369156 [<000000001a135943>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120 init_mm.page_table_lock 637661 [<0000000051481d84>] __pte_alloc_kernel+0x9b/0x370 init_mm.page_table_lock 1660 [<00000000a492cdc5>] __pmd_alloc+0x1d5/0x720 ----------------------- init_mm.page_table_lock 109410237 [<000000001a135943>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120 init_mm.page_table_lock 595016 [<0000000051481d84>] __pte_alloc_kernel+0x9b/0x370 init_mm.page_table_lock 3224 [<00000000a492cdc5>] __pmd_alloc+0x1d5/0x720 [More verifications on a 448-core server: Passed] 1) test_vmalloc module * Each test is run sequentially. 2) stress-ng * fork() and exit() # stress-ng --fork 448 --timeout 180 * pthread # stress-ng --pthread 448 --timeout 180 * fork()/exit() and pthread # stress-ng --pthread 448 --fork 448 --timeout 180 The above verifications were run repeatedly for more than 24 hours. [1] https://gist.github.com/AdrianHuang/99d12986a465cc33a38c7a7ceeb6f507 Signed-off-by: Adrian Huang <ahuang12@xxxxxxxxxx> --- mm/kasan/shadow.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c index 88d1c9dcb507..985356811aee 100644 --- a/mm/kasan/shadow.c +++ b/mm/kasan/shadow.c @@ -397,17 +397,13 @@ int kasan_populate_vmalloc(unsigned long addr, unsigned long size) static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr, void *unused) { + pte_t orig_pte = ptep_get_and_clear(&init_mm, addr, ptep); unsigned long page; - page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT); - - spin_lock(&init_mm.page_table_lock); - - if (likely(!pte_none(ptep_get(ptep)))) { - pte_clear(&init_mm, addr, ptep); + if (likely(!pte_none(orig_pte))) { + page = (unsigned long)__va(pte_pfn(orig_pte) << PAGE_SHIFT); free_page(page); } - spin_unlock(&init_mm.page_table_lock); return 0; } -- 2.34.1