On Sun, Oct 13, 2024 at 10:22:11PM -0400, Liam R. Howlett wrote: > * Sasha Levin <sashal@xxxxxxxxxx> [241013 09:29]: > > On Thu, Oct 10, 2024 at 04:28:18PM +0100, Lorenzo Stoakes wrote: > > > On Thu, Oct 10, 2024 at 08:19:28AM -0700, syzbot wrote: > > > > Hello, > > > > > > > > syzbot found the following issue on: > > > > > > > > HEAD commit: d3d1556696c1 Merge tag 'mm-hotfixes-stable-2024-10-09-15-4.. > > > > git tree: upstream > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=10416fd0580000 > > > > kernel config: https://syzkaller.appspot.com/x/.config?x=7a3fccdd0bb995 > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=39bc767144c55c8db0ea > > > > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 > > > > > > > > Unfortunately, I don't have any reproducer for this issue yet. > > > > > > > > Downloadable assets: > > > > disk image: https://storage.googleapis.com/syzbot-assets/0600b551e610/disk-d3d15566.raw.xz > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/d59d43ed3976/vmlinux-d3d15566.xz > > > > kernel image: https://storage.googleapis.com/syzbot-assets/e686a3e7e0d6/bzImage-d3d15566.xz > > > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > > > Reported-by: syzbot+39bc767144c55c8db0ea@xxxxxxxxxxxxxxxxxxxxxxxxx > > > > > > > > INFO: task syz.3.917:7739 blocked for more than 146 seconds. > > > > Not tainted 6.12.0-rc2-syzkaller-00074-gd3d1556696c1 #0 > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > task:syz.3.917 state:D stack:23808 pid:7739 tgid:7739 ppid:5232 flags:0x00004000 > > > > Call Trace: > > > > <TASK> > > > > context_switch kernel/sched/core.c:5322 [inline] > > > > __schedule+0x1843/0x4ae0 kernel/sched/core.c:6682 > > > > __schedule_loop kernel/sched/core.c:6759 [inline] > > > > schedule+0x14b/0x320 kernel/sched/core.c:6774 > > > > schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:6831 > > > > rwsem_down_write_slowpath+0xeee/0x13b0 kernel/locking/rwsem.c:1176 > > > > __down_write_common kernel/locking/rwsem.c:1304 [inline] > > > > __down_write kernel/locking/rwsem.c:1313 [inline] > > > > down_write+0x1d7/0x220 kernel/locking/rwsem.c:1578 > > > > mmap_write_lock include/linux/mmap_lock.h:106 [inline] > > > > exit_mmap+0x2bd/0xc40 mm/mmap.c:1872 > > > > > > Hmm, task freezing up or system becoming unstable/locked up is reminsecent > > > of the maple tree bug I fixed in [0], which is still in the unstable hotfix > > > branch. > > > > > > This is likely not going to repro as it's quite heisenbug-ish to trigger > > > and the failures are like this - somewhat disconnected from the cause, so > > > not sure if there is any case to speed this to Linus's tree. > > > > > > On the other hand it's a pretty serious problem for stability and likely to > > > continue to manifest in nasty ways like this. > > > > > > Can't be 100% sure this is the cause, but seems likely. > > > > > > [0]:https://lore.kernel.org/linux-mm/48b349a2a0f7c76e18772712d0997a5e12ab0a3b.1728314403.git.lorenzo.stoakes@xxxxxxxxxx/ > > > > On my Debian build box, running a 6.1 kernel, I've started hitting a > > similar issue: > > > > Oct 12 17:24:01 debian kernel: INFO: task sed:3557356 blocked for more than 1208 seconds. > > Oct 12 17:24:01 debian kernel: Not tainted 6.1.0-26-amd64 #1 Debian 6.1.112-1 > > Oct 12 17:24:01 debian kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > Oct 12 17:24:01 debian kernel: task:sed state:D stack:0 pid:3557356 ppid:1 flags:0x00000002 > > Oct 12 17:24:01 debian kernel: Call Trace: > > Oct 12 17:24:01 debian kernel: <TASK> > > Oct 12 17:24:01 debian kernel: __schedule+0x34d/0x9e0 > > Oct 12 17:24:01 debian kernel: schedule+0x5a/0xd0 > > Oct 12 17:24:01 debian kernel: rwsem_down_write_slowpath+0x311/0x6d0 > > Oct 12 17:24:01 debian kernel: exit_mmap+0xf6/0x2f0 > > Oct 12 17:24:01 debian kernel: __mmput+0x3e/0x130 > > Oct 12 17:24:01 debian kernel: do_exit+0x2fc/0xaf0 > > Oct 12 17:24:01 debian kernel: do_group_exit+0x2d/0x80 > > Oct 12 17:24:01 debian kernel: __x64_sys_exit_group+0x14/0x20 > > Oct 12 17:24:01 debian kernel: do_syscall_64+0x55/0xb0 > > Oct 12 17:24:01 debian kernel: ? do_fault+0x1a4/0x410 > > Oct 12 17:24:01 debian kernel: ? __handle_mm_fault+0x660/0xfa0 > > Oct 12 17:24:01 debian kernel: ? exit_to_user_mode_prepare+0x40/0x1e0 > > Oct 12 17:24:01 debian kernel: ? handle_mm_fault+0xdb/0x2d0 > > Oct 12 17:24:01 debian kernel: ? do_user_addr_fault+0x1b0/0x550 > > Oct 12 17:24:01 debian kernel: ? exit_to_user_mode_prepare+0x40/0x1e0 > > Oct 12 17:24:01 debian kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > > Oct 12 17:24:01 debian kernel: RIP: 0033:0x7f797d75a349 > > Oct 12 17:24:01 debian kernel: RSP: 002b:00007fff37f0d3c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 > > Oct 12 17:24:01 debian kernel: RAX: ffffffffffffffda RBX: 00007f797d8549e0 RCX: 00007f797d75a349 > > Oct 12 17:24:01 debian kernel: RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000 > > Oct 12 17:24:01 debian kernel: RBP: 0000000000000000 R08: fffffffffffffe98 R09: 00007fff37f0d2df > > Oct 12 17:24:01 debian kernel: R10: 00007fff37f0d240 R11: 0000000000000246 R12: 00007f797d8549e0 > > Oct 12 17:24:01 debian kernel: R13: 00007f797d85a2e0 R14: 0000000000000002 R15: 00007f797d85a2c8 > > Oct 12 17:24:01 debian kernel: </TASK> > > > > It reproduces fairly easily during a kernel build... > > > > It doesn't sound like the same issue you're pointing out, right Lorenzo? > > It could be. I suspect there has been a change recently that has > made the bug possible - although, I've not put effort into finding out > yet if that is true. If the bug existed for a long time (probably since > I fixed the live locking issue in 6.4 that was backported), then you > could be hitting it. > > It is a single line fix. If it happens frequently enough, you could try > it - this fix will go through the backporting route once it lands. > > Although, I am not sure it has much to do with the maple tree as I don't > think anyone should have the mm to take the mmap write lock. If we were > stuck in the maple tree somehow, the mm wouldn't reach the exit_mmap() > path - unless I missed something? I think this is the same bug, as the problem is, once it manifests, the actual problems it causes are delayed down the line until we hit a situation that is _caused by_ the bug but somewhat detached from it. In fact Bert and Mikhail hit the same thing with a process locking up in this path AND in that a lock is held that really makes no sense. As Liam says, this is a one-line change [0], could you try taking it and seeing? It has already been taken by Andrew, but we delay our hotfixes by a week or two so isn't in -rc yet. [0]: https://lore.kernel.org/linux-mm/48b349a2a0f7c76e18772712d0997a5e12ab0a3b.1728314403.git.lorenzo.stoakes@xxxxxxxxxx/ > > If you can dump the running tasks when you hit it, we could get a clue > from the (probably numerous) backtraces? I suggest first trying [0] :) if that doesn't fix things then we will definitely want to dig in further. > > Thanks, > Liam Thanks!