Re: [Bug 216489] New: Machine freezes due to memory lock

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Thu, 15 Sep 2022 13:39:31 -0700

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 14 Sep 2022 15:07:46 +0000 bugzilla-daemon@xxxxxxxxxx wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=216489
> 
>             Bug ID: 216489
>            Summary: Machine freezes due to memory lock
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 5.19.8
>           Hardware: AMD
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Other
>           Assignee: akpm@xxxxxxxxxxxxxxxxxxxx
>           Reporter: dev@xxxxxxxxxxx
>         Regression: No
> 
> Hi all,
> With Kernel 5.19.x we noticed system freezes. This happens in virtual
> environments as well as on real hardware.
> On a real hardware machine we were able to catch the moment of freeze with
> continuous profiling.

Thanks.  I forwarded this to Uladzislau and he offered to help.  He said:

: I can help with debugging. What i need is reproduce steps. Could you
: please clarify if it is easy to hit and what kind of profiling triggers it?

and

: I do not think that Matthew Wilcox commits destroys it but... I see that
: __vunmap() is invoked by the free_work() thus a caller is in atomic
: context including IRQ context.

> Specification of the machine where we captured the freeze:
> Thinkpad T14
> CPU: AMD Ryzen 7 PRO 4750U
> Kernel: 5.19.8-200.fc36.x86_64
> 
> Stacktrace of kworker/12:3 that is using all resources and causing the freeze:
> 
> #   Source Location                 Function Name               Function Line
> 0   arch/x86/include/asm/vdso/processor.h:13    rep_nop                 11
> 1   arch/x86/include/asm/vdso/processor.h:18    cpu_relax               16
> 2   kernel/locking/qspinlock.c:514          native_queued_spin_lock_slowpath   
> 316
> 3   kernel/locking/qspinlock.c:316          native_queued_spin_lock_slowpath   
> N/A
> 4   arch/x86/include/asm/paravirt.h:591     pv_queued_spin_lock_slowpath       
> 588
> 5   arch/x86/include/asm/qspinlock.h:51     queued_spin_lock_slowpath       49
> 6   include/asm-generic/qspinlock.h:114     queued_spin_lock            107
> 7   include/linux/spinlock.h:185            do_raw_spin_lock            182
> 8   include/linux/spinlock_api_smp.h:134        __raw_spin_lock             130
> 9   kernel/locking/spinlock.c:154           _raw_spin_lock              152
> 10  include/linux/spinlock.h:349            spin_lock               347
> 11  mm/vmalloc.c:1805               find_vmap_area              1801
> 12  mm/vmalloc.c:2525               find_vm_area                2521
> 13  mm/vmalloc.c:2639               __vunmap                2628
> 14  mm/vmalloc.c:97                 free_work               91
> 15  kernel/workqueue.c:2289             process_one_work            2181
> 16  kernel/workqueue.c:2436             worker_thread               2378
> 17  kernel/kthread.c:376                kthread                 330
> 18  N/A                     ret_from_fork               N/A
> 
> The functions in the above shown stacktrace hardly change. There is only one
> commit 993d0b287e2ef7bee2e8b13b0ce4d2b5066f278e which introduces changes to
> find_vmap_area() for 5.19.
> 
> With this change in mind we looked for stacktraces which make also use of this
> new commit. And in a different kernel thread we do notice the use of
> check_heap_object():
> 
> #   Source Location             Function Name           Function Line
> 0   arch/x86/include/asm/paravirt.h:704 arch_local_irq_enable       702
> 1   arch/x86/include/asm/irqflags.h:138 arch_local_irq_restore      135
> 2   kernel/sched/sched.h:1330       raw_spin_rq_unlock_irqrestore   1327
> 3   kernel/sched/sched.h:1327       raw_spin_rq_unlock_irqrestore   N/A
> 4   kernel/sched/sched.h:1611       rq_unlock_irqrestore        1607
> 5   kernel/sched/fair.c:8288        update_blocked_averages     8272
> 6   kernel/sched/fair.c:11133       run_rebalance_domains       11115
> 7   kernel/softirq.c:571            __do_softirq            528
> 8   kernel/softirq.c:445            invoke_softirq          433
> 9   kernel/softirq.c:650            __irq_exit_rcu          640
> 10  arch/x86/kernel/apic/apic.c:1106    sysvec_apic_timer_interrupt N/A
> 11  N/A                 asm_sysvec_apic_timer_interrupt N/A
> 12  include/linux/mmzone.h:1403     __nr_to_section         1395
> 13  include/linux/mmzone.h:1488     __pfn_to_section        1486
> 14  include/linux/mmzone.h:1539     pfn_valid           1524
> 15  arch/x86/mm/physaddr.c:65       __virt_addr_valid       47
> 16  mm/usercopy.c:188           check_heap_object       161
> 17  mm/usercopy.c:250           __check_object_size     212
> 18  mm/usercopy.c:212           __check_object_size     N/A
> 19  include/linux/thread_info.h:199     check_object_size       195
> 20  lib/strncpy_from_user.c:137     strncpy_from_user       113
> 21  fs/namei.c:150              getname_flags           129
> 22  fs/namei.c:2896             user_path_at_empty      2893
> 23  include/linux/namei.h:57        user_path_at            54
> 24  fs/open.c:446               do_faccessat            420
> 25  arch/x86/entry/common.c:50      do_syscall_x64          40
> 26  arch/x86/entry/common.c:80      do_syscall_64           73
> 27  N/A                 entry_SYSCALL_64_after_hwframe  N/A
> 
> We are neither experts in the mm subsystem nor can provide a fix, but wanted to
> let you know about our findings.
> 
> Cheers,
>  Florian
> 
> -- 
> You may reply to this email to add a comment.
> 
> You are receiving this mail because:
> You are the assignee for the bug.