On 10/23/23 7:12 PM, Dave Chinner wrote: > [cc Jens, io-uring] Thanks! > On Tue, Oct 17, 2023 at 07:50:09PM -0700, Andres Freund wrote: >> Hi, >> >> On 2023-10-17 20:43:35 -0400, Theodore Ts'o wrote: >>> On Mon, Oct 16, 2023 at 08:37:25PM -0700, Andres Freund wrote: >>>> I just was able to reproduce the issue, after upgrading to 6.6-rc6 - this time >>>> it took ~55min of high load (io_uring using branch of postgres, running a >>>> write heavy transactional workload concurrently with concurrent bulk data >>>> load) to trigger the issue. >>>> >>>> For now I have left the system running, in case there's something you would >>>> like me to check while the system is hung. >>>> >>>> The first hanging task that I observed: >>>> >>>> cat /proc/57606/stack >>>> [<0>] inode_dio_wait+0xd5/0x100 >>>> [<0>] ext4_fallocate+0x12f/0x1040 >>>> [<0>] vfs_fallocate+0x135/0x360 >>>> [<0>] __x64_sys_fallocate+0x42/0x70 >>>> [<0>] do_syscall_64+0x38/0x80 >>>> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 >>> >>> This stack trace is from some process (presumably postgres) trying to >>> do a fallocate() system call: >> >> Yes, it's indeed postgres. >> >> >>>> [ 3194.579297] INFO: task iou-wrk-58004:58874 blocked for more than 122 seconds. >>>> [ 3194.579304] Not tainted 6.6.0-rc6-andres-00001-g01edcfe38260 #77 >>>> [ 3194.579310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> [ 3194.579314] task:iou-wrk-58004 state:D stack:0 pid:58874 ppid:52606 flags:0x00004000 >>>> [ 3194.579325] Call Trace: >>>> [ 3194.579329] <TASK> >>>> [ 3194.579334] __schedule+0x388/0x13e0 >>>> [ 3194.579349] schedule+0x5f/0xe0 >>>> [ 3194.579361] schedule_preempt_disabled+0x15/0x20 >>>> [ 3194.579374] rwsem_down_read_slowpath+0x26e/0x4c0 >>>> [ 3194.579385] down_read+0x44/0xa0 >>>> [ 3194.579393] ext4_file_write_iter+0x432/0xa80 >>>> [ 3194.579407] io_write+0x129/0x420 >>> >>> This could potentially be a interesting stack trace; but this is where >>> we really need to map the stack address to line numbers. Is that >>> something you could do? >> >> Converting the above with scripts/decode_stacktrace.sh yields: >> >> __schedule (kernel/sched/core.c:5382 kernel/sched/core.c:6695) >> schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 13) kernel/sched/core.c:6772 (discriminator 13)) >> schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 (discriminator 10) kernel/sched/core.c:6831 (discriminator 10)) >> rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073 (discriminator 4)) >> down_read (./arch/x86/include/asm/preempt.h:95 kernel/locking/rwsem.c:1257 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522) >> ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715) >> io_write (./include/linux/fs.h:1956 io_uring/rw.c:926) >> >> But I'm not sure that that stacktrace is quite right (e.g. what's >> ./include/linux/fs.h:1073 doing in this stacktrace?). But with an optimized >> build it's a bit hard to tell what's an optimization artifact and what's an >> off stack trace... >> >> >> I had the idea to look at the stacks (via /proc/$pid/stack) for all postgres >> processes and the associated io-uring threads, and then to deduplicate them. >> >> 22x: >> ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715) >> io_write (./include/linux/fs.h:1956 io_uring/rw.c:926) >> io_issue_sqe (io_uring/io_uring.c:1878) >> io_wq_submit_work (io_uring/io_uring.c:1960) >> io_worker_handle_work (io_uring/io-wq.c:596) >> io_wq_worker (io_uring/io-wq.c:258 io_uring/io-wq.c:648) >> ret_from_fork (arch/x86/kernel/process.c:147) >> ret_from_fork_asm (arch/x86/entry/entry_64.S:312) > > io_uring does some interesting stuff with IO completion and iomap > now - IIRC IOCB_DIO_CALLER_COMP is new 6.6-rc1 functionality. This > flag is set by io_write() to tell the iomap IO completion handler > that the caller will the IO completion, avoiding a context switch > to run the completion in a kworker thread. > > It's not until the caller runs iocb->dio_complete() that > inode_dio_end() is called to drop the i_dio_count. This happens when > io_uring gets completion notification from iomap via > iocb->ki_complete(iocb, 0); > > It then requires the io_uring layer to process the completion > and call iocb->dio_complete() before the inode->i_dio_count is > dropped and inode_dio_wait() will complete. > > So what I suspect here is that we have a situation where the worker > that would run the completion is blocked on the inode rwsem because > this isn't a IOCB_NOWAIT IO and the fallocate call holds the rwsem > exclusive and is waiting on inode_dio_wait() to return. > > Cc Jens, because I'm not sure this is actually an ext4 problem - I > can't see why XFS won't have the same issue w.r.t. > truncate/fallocate and IOCB_DIO_CALLER_COMP delaying the > inode_dio_end() until whenever the io_uring code can get around to > processing the delayed completion.... I don't think this is related to the io-wq workers doing non-blocking IO. The callback is eventually executed by the task that originally submitted the IO, which is the owner and not the async workers. But... If that original task is blocked in eg fallocate, then I can see how that would potentially be an issue. I'll take a closer look. -- Jens Axboe