On 11/3/24 4:47 PM, Andrew Marshall wrote: > Hi, > > I, and others (see downstream report below), are encountering io_uring > at times hanging on 6.6.59 LTS. If the process is killed, the process > remains stuck in sleep uninterruptible ("D"). This failure can be > fairly reliably reproduced via Node.js with `npm ci` in at least some > projects; disabling that tool?s use of io_uring causes via its > configuration causes it to succeed. I have identified what seems to be > the problematic commit on linux-6.6.y (f4ce3b5). > > Summary of Kernel version triaging: > > - 6.6.56: succeeds > - 6.6.57: fails > - 6.6.58: fails > - 6.6.59: fails > - 6.6.59 (with f4ce3b5 reverted): succeeds > - 6.11.6: succeeds > > System logs upon failure indicate hung task: > > kernel: INFO: task npm ci:47920 blocked for more than 245 seconds. > kernel: Tainted: P O 6.6.58 #1-NixOS > kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kernel: task:npm ci state:D stack:0 pid:47920 ppid:47710 flags:0x00004006 > kernel: Call Trace: > kernel: <TASK> > kernel: __schedule+0x3fc/0x1430 > kernel: ? sysvec_apic_timer_interrupt+0xe/0x90 > kernel: schedule+0x5e/0xe0 > kernel: schedule_preempt_disabled+0x15/0x30 > kernel: __mutex_lock.constprop.0+0x3a2/0x6b0 > kernel: io_uring_del_tctx_node+0x61/0xf0 > kernel: io_uring_clean_tctx+0x5c/0xc0 > kernel: io_uring_cancel_generic+0x198/0x350 > kernel: ? srso_return_thunk+0x5/0x5f > kernel: ? timerqueue_del+0x2e/0x50 > kernel: ? __pfx_autoremove_wake_function+0x10/0x10 > kernel: do_exit+0x167/0xad0 > kernel: ? __pfx_hrtimer_wakeup+0x10/0x10 > kernel: do_group_exit+0x31/0x80 > kernel: get_signal+0xa60/0xa60 > kernel: arch_do_signal_or_restart+0x3e/0x280 > kernel: exit_to_user_mode_prepare+0x1d4/0x230 > kernel: syscall_exit_to_user_mode+0x1b/0x50 > kernel: do_syscall_64+0x45/0x90 > kernel: entry_SYSCALL_64_after_hwframe+0x78/0xe2 > > For more details, see the downstream bug report in Node.js: https://github.com/nodejs/node/issues/55587 > > I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely > problematic commit simply by browsing git log. As indicated above; > reverting that atop 6.6.59 results in success. Since it is passing on > 6.11.6, I suspect there is some missing backport to 6.6.x, or some > other semantic merge conflict. Unfortunately I do not have a compact, > minimal reproducer, but can provide my large one (it is testing a > larger build process in a VM) if needed?there are some additional > details in the above-linked downstream bug report, though. I hope that > having identified the problematic commit is enough for someone with > more context to go off of. Happy to provide more information if > needed. Don't worry about not having a reproducer, having the backport commit pin pointed will do just fine. I'll take a look at this. -- Jens Axboe