Re: [PATCH -rfc 0/3] mm: kasan: fix softlock when populate or depopulate pte

Uladzislau Rezki <urezki@xxxxxxxxx> · Thu, 19 Oct 2023 10:53:48 +0200

On Thu, Oct 19, 2023 at 03:26:48PM +0800, Kefeng Wang wrote:
> 
> 
> On 2023/10/19 14:17, Uladzislau Rezki wrote:
> > On Thu, Oct 19, 2023 at 09:40:10AM +0800, Kefeng Wang wrote:
> > > 
> > > 
> > > On 2023/10/19 0:37, Marco Elver wrote:
> > > > On Wed, 18 Oct 2023 at 16:16, 'Kefeng Wang' via kasan-dev
> > > > <kasan-dev@xxxxxxxxxxxxxxxx> wrote:
> > > > > 
> > > > > The issue is easy to reproduced with large vmalloc, kindly ping...
> > > > > 
> > > > > On 2023/9/15 8:58, Kefeng Wang wrote:
> > > > > > Hi All， any suggest or comments，many thanks.
> > > > > > 
> > > > > > On 2023/9/6 20:42, Kefeng Wang wrote:
> > > > > > > This is a RFC, even patch3 is a hack to fix the softlock issue when
> > > > > > > populate or depopulate pte with large region, looking forward to your
> > > > > > > reply and advise, thanks.
> > > > > > 
> > > > > > Here is full stack，for populate pte，
> > > > > > 
> > > > > > [    C3] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [insmod:458]
> > > > > > [    C3] Modules linked in: test(OE+)
> > > > > > [    C3] irq event stamp: 320776
> > > > > > [    C3] hardirqs last  enabled at (320775): [<ffff8000815a0c98>]
> > > > > > _raw_spin_unlock_irqrestore+0x98/0xb8
> > > > > > [    C3] hardirqs last disabled at (320776): [<ffff8000815816e0>]
> > > > > > el1_interrupt+0x38/0xa8
> > > > > > [    C3] softirqs last  enabled at (318174): [<ffff800080040ba8>]
> > > > > > __do_softirq+0x658/0x7ac
> > > > > > [    C3] softirqs last disabled at (318169): [<ffff800080047fd8>]
> > > > > > ____do_softirq+0x18/0x30
> > > > > > [    C3] CPU: 3 PID: 458 Comm: insmod Tainted: G           OE 6.5.0+ #595
> > > > > > [    C3] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> > > > > > [    C3] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > [    C3] pc : _raw_spin_unlock_irqrestore+0x50/0xb8
> > > > > > [    C3] lr : _raw_spin_unlock_irqrestore+0x98/0xb8
> > > > > > [    C3] sp : ffff800093386d70
> > > > > > [    C3] x29: ffff800093386d70 x28: 0000000000000801 x27: ffff0007ffffa9c0
> > > > > > [    C3] x26: 0000000000000000 x25: 000000000000003f x24: fffffc0004353708
> > > > > > [    C3] x23: ffff0006d476bad8 x22: fffffc0004353748 x21: 0000000000000000
> > > > > > [    C3] x20: ffff0007ffffafc0 x19: 0000000000000000 x18: 0000000000000000
> > > > > > [    C3] x17: ffff80008024e7fc x16: ffff80008055a8f0 x15: ffff80008024ec60
> > > > > > [    C3] x14: ffff80008024ead0 x13: ffff80008024e7fc x12: ffff6000fffff5f9
> > > > > > [    C3] x11: 1fffe000fffff5f8 x10: ffff6000fffff5f8 x9 : 1fffe000fffff5f8
> > > > > > [    C3] x8 : dfff800000000000 x7 : 00000000f2000000 x6 : dfff800000000000
> > > > > > [    C3] x5 : 00000000f2f2f200 x4 : dfff800000000000 x3 : ffff700012670d70
> > > > > > [    C3] x2 : 0000000000000001 x1 : c9a5dbfae610fa24 x0 : 000000000004e507
> > > > > > [    C3] Call trace:
> > > > > > [    C3]  _raw_spin_unlock_irqrestore+0x50/0xb8
> > > > > > [    C3]  rmqueue_bulk+0x434/0x6b8
> > > > > > [    C3]  get_page_from_freelist+0xdd4/0x1680
> > > > > > [    C3]  __alloc_pages+0x244/0x508
> > > > > > [    C3]  alloc_pages+0xf0/0x218
> > > > > > [    C3]  __get_free_pages+0x1c/0x50
> > > > > > [    C3]  kasan_populate_vmalloc_pte+0x30/0x188
> > > > > > [    C3]  __apply_to_page_range+0x3ec/0x650
> > > > > > [    C3]  apply_to_page_range+0x1c/0x30
> > > > > > [    C3]  kasan_populate_vmalloc+0x60/0x70
> > > > > > [    C3]  alloc_vmap_area.part.67+0x328/0xe50
> > > > > > [    C3]  alloc_vmap_area+0x4c/0x78
> > > > > > [    C3]  __get_vm_area_node.constprop.76+0x130/0x240
> > > > > > [    C3]  __vmalloc_node_range+0x12c/0x340
> > > > > > [    C3]  __vmalloc_node+0x8c/0xb0
> > > > > > [    C3]  vmalloc+0x2c/0x40
> > > > > > [    C3]  show_mem_init+0x1c/0xff8 [test]
> > > > > > [    C3]  do_one_initcall+0xe4/0x500
> > > > > > [    C3]  do_init_module+0x100/0x358
> > > > > > [    C3]  load_module+0x2e64/0x2fc8
> > > > > > [    C3]  init_module_from_file+0xec/0x148
> > > > > > [    C3]  idempotent_init_module+0x278/0x380
> > > > > > [    C3]  __arm64_sys_finit_module+0x88/0xf8
> > > > > > [    C3]  invoke_syscall+0x64/0x188
> > > > > > [    C3]  el0_svc_common.constprop.1+0xec/0x198
> > > > > > [    C3]  do_el0_svc+0x48/0xc8
> > > > > > [    C3]  el0_svc+0x3c/0xe8
> > > > > > [    C3]  el0t_64_sync_handler+0xa0/0xc8
> > > > > > [    C3]  el0t_64_sync+0x188/0x190
> > > > > > 
This trace is stuck in the rmqueue_bulk() because you request a
huge alloc size. It has nothing to do with free_vmap_area_lock,
it is about bulk allocator. It gets stuck to accomplish such
demand.


> > > > > > and for depopuldate pte，
> > > > > > 
> > > > > > [    C6] watchdog: BUG: soft lockup - CPU#6 stuck for 48s! [kworker/6:1:59]
> > > > > > [    C6] Modules linked in: test(OE+)
> > > > > > [    C6] irq event stamp: 39458
> > > > > > [    C6] hardirqs last  enabled at (39457): [<ffff8000815a0c98>]
> > > > > > _raw_spin_unlock_irqrestore+0x98/0xb8
> > > > > > [    C6] hardirqs last disabled at (39458): [<ffff8000815816e0>]
> > > > > > el1_interrupt+0x38/0xa8
> > > > > > [    C6] softirqs last  enabled at (39420): [<ffff800080040ba8>]
> > > > > > __do_softirq+0x658/0x7ac
> > > > > > [    C6] softirqs last disabled at (39415): [<ffff800080047fd8>]
> > > > > > ____do_softirq+0x18/0x30
> > > > > > [    C6] CPU: 6 PID: 59 Comm: kworker/6:1 Tainted: G           OEL
> > > > > > 6.5.0+ #595
> > > > > > [    C6] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> > > > > > [    C6] Workqueue: events drain_vmap_area_work
> > > > > > [    C6] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > [    C6] pc : _raw_spin_unlock_irqrestore+0x50/0xb8
> > > > > > [    C6] lr : _raw_spin_unlock_irqrestore+0x98/0xb8
> > > > > > [    C6] sp : ffff80008fe676b0
> > > > > > [    C6] x29: ffff80008fe676b0 x28: fffffc000601d310 x27: ffff000edf5dfa80
> > > > > > [    C6] x26: ffff000edf5dfad8 x25: 0000000000000000 x24: 0000000000000006
> > > > > > [    C6] x23: ffff000edf5dfad4 x22: 0000000000000000 x21: 0000000000000006
> > > > > > [    C6] x20: ffff0007ffffafc0 x19: 0000000000000000 x18: 0000000000000000
> > > > > > [    C6] x17: ffff8000805544b8 x16: ffff800080553d94 x15: ffff8000805c11b0
> > > > > > [    C6] x14: ffff8000805baeb0 x13: ffff800080047e10 x12: ffff6000fffff5f9
> > > > > > [    C6] x11: 1fffe000fffff5f8 x10: ffff6000fffff5f8 x9 : 1fffe000fffff5f8
> > > > > > [    C6] x8 : dfff800000000000 x7 : 00000000f2000000 x6 : dfff800000000000
> > > > > > [    C6] x5 : 00000000f2f2f200 x4 : dfff800000000000 x3 : ffff700011fcce98
> > > > > > [    C6] x2 : 0000000000000001 x1 : cf09d5450e2b4f7f x0 : 0000000000009a21
> > > > > > [    C6] Call trace:
> > > > > > [    C6]  _raw_spin_unlock_irqrestore+0x50/0xb8
> > > > > > [    C6]  free_pcppages_bulk+0x2bc/0x3e0
> > > > > > [    C6]  free_unref_page_commit+0x1fc/0x290
> > > > > > [    C6]  free_unref_page+0x184/0x250
> > > > > > [    C6]  __free_pages+0x154/0x1a0
> > > > > > [    C6]  free_pages+0x88/0xb0
> > > > > > [    C6]  kasan_depopulate_vmalloc_pte+0x58/0x80
> > > > > > [    C6]  __apply_to_page_range+0x3ec/0x650
> > > > > > [    C6]  apply_to_existing_page_range+0x1c/0x30
> > > > > > [    C6]  kasan_release_vmalloc+0xa4/0x118
> > > > > > [    C6]  __purge_vmap_area_lazy+0x4f4/0xe30
> > > > > > [    C6]  drain_vmap_area_work+0x60/0xc0
> > > > > > [    C6]  process_one_work+0x4cc/0xa38
> > > > > > [    C6]  worker_thread+0x240/0x638
> > > > > > [    C6]  kthread+0x1c8/0x1e0
> > > > > > [    C6]  ret_from_fork+0x10/0x20
> > > > > > 
> 
> See Call Trace of softlock, when map/unmap the vmalloc buf, the kasan will
> populate and depopulate vmalloc pte, those will spend more time than
> no-kasan kernel, for unmap, and there is already a cond_resched_lock() in
> __purge_vmap_area_lazy(), but with more time consumed under
> spinlock(free_vmap_area_lock), and we couldn't add cond_resched_lock in
> kasan_depopulate_vmalloc_pte(), so if spin lock converted to mutex lock, we
> could add a cond_resched into kasan depopulate, this is why make such
> conversion if kasan enabled, but this
> conversion maybe not correct, any better solution?
> 
I have at least below thoughts:

a) Add a max allowed threshold that user can request over vmalloc() call.
  I do not think ~40G is a correct request.

b) This can fix unmap path: 

<snip>

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ef8599d394fd..988735da5c5c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1723,7 +1723,6 @@ static void purge_fragmented_blocks_allcpus(void);
  */
 static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 {
-	unsigned long resched_threshold;
 	unsigned int num_purged_areas = 0;
 	struct list_head local_purge_list;
 	struct vmap_area *va, *n_va;
@@ -1747,36 +1746,32 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 			struct vmap_area, list)->va_end);
 
 	flush_tlb_kernel_range(start, end);
-	resched_threshold = lazy_max_pages() << 1;
 
-	spin_lock(&free_vmap_area_lock);
 	list_for_each_entry_safe(va, n_va, &local_purge_list, list) {
 		unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
 		unsigned long orig_start = va->va_start;
 		unsigned long orig_end = va->va_end;
 
+		if (is_vmalloc_or_module_addr((void *)orig_start))
+			kasan_release_vmalloc(orig_start, orig_end,
+					      va->va_start, va->va_end);
+
 		/*
 		 * Finally insert or merge lazily-freed area. It is
 		 * detached and there is no need to "unlink" it from
 		 * anything.
 		 */
+		spin_lock(&free_vmap_area_lock);
 		va = merge_or_add_vmap_area_augment(va, &free_vmap_area_root,
 				&free_vmap_area_list);
+		spin_unlock(&free_vmap_area_lock);
 
 		if (!va)
 			continue;
 
-		if (is_vmalloc_or_module_addr((void *)orig_start))
-			kasan_release_vmalloc(orig_start, orig_end,
-					      va->va_start, va->va_end);
-
 		atomic_long_sub(nr, &vmap_lazy_nr);
 		num_purged_areas++;
-
-		if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
-			cond_resched_lock(&free_vmap_area_lock);
 	}
-	spin_unlock(&free_vmap_area_lock);
 
 out:
 	trace_purge_vmap_area_lazy(start, end, num_purged_areas);
<snip>

c) bulk-path i have not checked, but on a high level kasan_populate_vmalloc()
should take a breath between requests.

--
Uladzislau Rezki