[PATCH 1/1] kasan, vmalloc: avoid lock contention when depopulating vmalloc

Adrian Huang <adrianhuang0701@xxxxxxxxx> · Wed, 25 Sep 2024 21:47:32 +0800

From: Adrian Huang <ahuang12@xxxxxxxxxx>

When running the test_vmalloc stress on a 448-core server, the following
soft/hard lockups were observed and the OS was panicked eventually.

1) Kernel config
   CONFIG_KASAN=y
   CONFIG_KASAN_VMALLOC=y

2) Reproduced command
   # modprobe test_vmalloc nr_threads=448 run_test_mask=0x1 nr_pages=8

3) OS Log: Detail is in [1].
   watchdog: BUG: soft lockup - CPU#258 stuck for 26s!
   RIP: 0010:native_queued_spin_lock_slowpath+0x504/0x940
   Call Trace:
    do_raw_spin_lock+0x1e7/0x270
    _raw_spin_lock+0x63/0x80
    kasan_depopulate_vmalloc_pte+0x3c/0x70
    apply_to_pte_range+0x127/0x4e0
    apply_to_pmd_range+0x19e/0x5c0
    apply_to_pud_range+0x167/0x510
    __apply_to_page_range+0x2b4/0x7c0
    kasan_release_vmalloc+0xc8/0xd0
    purge_vmap_node+0x190/0x980
    __purge_vmap_area_lazy+0x640/0xa60
    drain_vmap_area_work+0x23/0x30
    process_one_work+0x84a/0x1760
    worker_thread+0x54d/0xc60
    kthread+0x2a8/0x380
    ret_from_fork+0x2d/0x70
    ret_from_fork_asm+0x1a/0x30
   ...
   watchdog: Watchdog detected hard LOCKUP on cpu 8
   watchdog: Watchdog detected hard LOCKUP on cpu 42
   watchdog: Watchdog detected hard LOCKUP on cpu 10
   ...
   Shutting down cpus with NMI
   Kernel Offset: disabled
   pstore: backend (erst) writing error (-28)
   ---[ end Kernel panic - not syncing: Hard LOCKUP ]---

BTW, the issue can be also reproduced on a 192-core server and a 256-core
server.

[Root Cause]
The tight loop in kasan_release_vmalloc_node() iteratively calls
kasan_release_vmalloc() to clear the corresponding PTE, which
acquires/releases "init_mm.page_table_lock" in
kasan_depopulate_vmalloc_pte().

The lock_stat shows that the "init_mm.page_table_lock" is the first entry
of top list of the contentions. This lock_stat info is based on the
following command (in order not to get OS panicked), where the max
wait time is 600ms:

  # modprobe test_vmalloc nr_threads=150 run_test_mask=0x1 nr_pages=8

<snip>
------------------------------------------------------------------
class name con-bounces contentions waittime-min   waittime-max ...
------------------------------------------------------------------
init_mm.page_table_lock:  87859653 93020601  0.27 600304.90 ...
  -----------------------
  init_mm.page_table_lock  54332301  [<000000008ce229be>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120
  init_mm.page_table_lock   6680902  [<000000009c0800ad>] __pte_alloc_kernel+0x9b/0x370
  init_mm.page_table_lock  31991077  [<00000000180bc35d>] kasan_depopulate_vmalloc_pte+0x3c/0x70
  init_mm.page_table_lock     16321  [<000000003ef0e79b>] __pmd_alloc+0x1d5/0x720
  -----------------------
  init_mm.page_table_lock  50278552  [<000000008ce229be>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120
  init_mm.page_table_lock   5725380  [<000000009c0800ad>] __pte_alloc_kernel+0x9b/0x370
  init_mm.page_table_lock  36992410  [<00000000180bc35d>] kasan_depopulate_vmalloc_pte+0x3c/0x70
  init_mm.page_table_lock     24259  [<000000003ef0e79b>] __pmd_alloc+0x1d5/0x720
  ...
<snip>

[Solution]
After re-visiting code path about setting the kasan ptep (pte pointer),
it's unlikely that a kasan ptep is set and cleared simultaneously by
different CPUs. So, use ptep_get_and_clear() to get rid of the spinlock
operation.

The result shows the max wait time is 13ms with the following command
(448 cores are fully stressed):

  # modprobe test_vmalloc nr_threads=448 run_test_mask=0x1 nr_pages=8

<snip>
------------------------------------------------------------------
class name con-bounces contentions waittime-min   waittime-max ...
------------------------------------------------------------------
init_mm.page_table_lock:  109999304  110008477  0.27  13534.76
  -----------------------
  init_mm.page_table_lock 109369156  [<000000001a135943>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120
  init_mm.page_table_lock    637661  [<0000000051481d84>] __pte_alloc_kernel+0x9b/0x370
  init_mm.page_table_lock      1660  [<00000000a492cdc5>] __pmd_alloc+0x1d5/0x720
  -----------------------
  init_mm.page_table_lock 109410237  [<000000001a135943>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120
  init_mm.page_table_lock    595016  [<0000000051481d84>] __pte_alloc_kernel+0x9b/0x370
  init_mm.page_table_lock      3224  [<00000000a492cdc5>] __pmd_alloc+0x1d5/0x720

[More verifications on a 448-core server: Passed]
1) test_vmalloc module
   * Each test is run sequentially.

2) stress-ng
   * fork() and exit()
       # stress-ng --fork 448 --timeout 180
   * pthread
       # stress-ng --pthread 448 --timeout 180
   * fork()/exit() and pthread
       # stress-ng --pthread 448 --fork 448 --timeout 180

The above verifications were run repeatedly for more than 24 hours.

[1] https://gist.github.com/AdrianHuang/99d12986a465cc33a38c7a7ceeb6f507

Signed-off-by: Adrian Huang <ahuang12@xxxxxxxxxx>
---
 mm/kasan/shadow.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 88d1c9dcb507..985356811aee 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -397,17 +397,13 @@ int kasan_populate_vmalloc(unsigned long addr, unsigned long size)
 static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 					void *unused)
 {
+	pte_t orig_pte = ptep_get_and_clear(&init_mm, addr, ptep);
 	unsigned long page;
 
-	page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
-
-	spin_lock(&init_mm.page_table_lock);
-
-	if (likely(!pte_none(ptep_get(ptep)))) {
-		pte_clear(&init_mm, addr, ptep);
+	if (likely(!pte_none(orig_pte))) {
+		page = (unsigned long)__va(pte_pfn(orig_pte) << PAGE_SHIFT);
 		free_page(page);
 	}
-	spin_unlock(&init_mm.page_table_lock);
 
 	return 0;
 }
-- 
2.34.1