Re: task hung in ext4_fallocate #2

Andres Freund <andres@xxxxxxxxxxx> · Tue, 17 Oct 2023 19:50:09 -0700

Hi,

On 2023-10-17 20:43:35 -0400, Theodore Ts'o wrote:
> On Mon, Oct 16, 2023 at 08:37:25PM -0700, Andres Freund wrote:
> > I just was able to reproduce the issue, after upgrading to 6.6-rc6 - this time
> > it took ~55min of high load (io_uring using branch of postgres, running a
> > write heavy transactional workload concurrently with concurrent bulk data
> > load) to trigger the issue.
> >
> > For now I have left the system running, in case there's something you would
> > like me to check while the system is hung.
> >
> > The first hanging task that I observed:
> >
> > cat /proc/57606/stack
> > [<0>] inode_dio_wait+0xd5/0x100
> > [<0>] ext4_fallocate+0x12f/0x1040
> > [<0>] vfs_fallocate+0x135/0x360
> > [<0>] __x64_sys_fallocate+0x42/0x70
> > [<0>] do_syscall_64+0x38/0x80
> > [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
>
> This stack trace is from some process (presumably postgres) trying to
> do a fallocate() system call:

Yes, it's indeed postgres.

> > [ 3194.579297] INFO: task iou-wrk-58004:58874 blocked for more than 122 seconds.
> > [ 3194.579304]       Not tainted 6.6.0-rc6-andres-00001-g01edcfe38260 #77
> > [ 3194.579310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 3194.579314] task:iou-wrk-58004   state:D stack:0     pid:58874 ppid:52606  flags:0x00004000
> > [ 3194.579325] Call Trace:
> > [ 3194.579329]  <TASK>
> > [ 3194.579334]  __schedule+0x388/0x13e0
> > [ 3194.579349]  schedule+0x5f/0xe0
> > [ 3194.579361]  schedule_preempt_disabled+0x15/0x20
> > [ 3194.579374]  rwsem_down_read_slowpath+0x26e/0x4c0
> > [ 3194.579385]  down_read+0x44/0xa0
> > [ 3194.579393]  ext4_file_write_iter+0x432/0xa80
> > [ 3194.579407]  io_write+0x129/0x420
>
> This could potentially be a interesting stack trace; but this is where
> we really need to map the stack address to line numbers.  Is that
> something you could do?

Converting the above with scripts/decode_stacktrace.sh yields:

__schedule (kernel/sched/core.c:5382 kernel/sched/core.c:6695)
schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 13) kernel/sched/core.c:6772 (discriminator 13))
schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 (discriminator 10) kernel/sched/core.c:6831 (discriminator 10))
rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073 (discriminator 4))
down_read (./arch/x86/include/asm/preempt.h:95 kernel/locking/rwsem.c:1257 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522)
ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715)
io_write (./include/linux/fs.h:1956 io_uring/rw.c:926)

But I'm not sure that that stacktrace is quite right (e.g. what's
./include/linux/fs.h:1073 doing in this stacktrace?). But with an optimized
build it's a bit hard to tell what's an optimization artifact and what's an
off stack trace...

I had the idea to look at the stacks (via /proc/$pid/stack) for all postgres
processes and the associated io-uring threads, and then to deduplicate them.

22x:
ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715)
io_write (./include/linux/fs.h:1956 io_uring/rw.c:926)
io_issue_sqe (io_uring/io_uring.c:1878)
io_wq_submit_work (io_uring/io_uring.c:1960)
io_worker_handle_work (io_uring/io-wq.c:596)
io_wq_worker (io_uring/io-wq.c:258 io_uring/io-wq.c:648)
ret_from_fork (arch/x86/kernel/process.c:147)
ret_from_fork_asm (arch/x86/entry/entry_64.S:312)

8x:
__do_sys_io_uring_enter (io_uring/io_uring.c:2553 (discriminator 1) io_uring/io_uring.c:2622 (discriminator 1) io_uring/io_uring.c:3706 (discriminator 1))
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

2x:
io_wq_worker (./arch/x86/include/asm/current.h:41 (discriminator 5) io_uring/io-wq.c:668 (discriminator 5))
ret_from_fork (arch/x86/kernel/process.c:147)
ret_from_fork_asm (arch/x86/entry/entry_64.S:312)

2x:
futex_wait_queue (./arch/x86/include/asm/current.h:41 (discriminator 5) kernel/futex/waitwake.c:357 (discriminator 5))
futex_wait (kernel/futex/waitwake.c:660)
do_futex (kernel/futex/syscalls.c:106 (discriminator 1))
__x64_sys_futex (kernel/futex/syscalls.c:183 kernel/futex/syscalls.c:164 kernel/futex/syscalls.c:164)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

1x:
do_epoll_wait (fs/eventpoll.c:1921 (discriminator 1) fs/eventpoll.c:2318 (discriminator 1))
__x64_sys_epoll_wait (fs/eventpoll.c:2325)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

1x:
inode_dio_wait (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:444 ./include/linux/atomic/atomic-instrumented.h:33 fs/inode.c:2429 fs/inode.c:2446)
ext4_fallocate (fs/ext4/extents.c:4752)
vfs_fallocate (fs/open.c:324)
__x64_sys_fallocate (./include/linux/file.h:45 fs/open.c:348 fs/open.c:355 fs/open.c:353 fs/open.c:353)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

> > Once I hear that you don't want me to test something out on the running
> > system, I think a sensible next step could be to compile with lockdep and see
> > if that finds a problem?
>
> That's certainly a possibiity.  But also please make sure that you can
> compile with with debugging information enabled so that we can get
> reliable line numbers.  I use:
>
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
> CONFIG_DEBUG_INFO_REDUCED=y

The kernel from above had debug info enabled, albeit forced to dwarf4 (IIRC I
ran into issues with pahole not understanding all of dwarf5):

$ zgrep DEBUG_INFO /proc/config.gz
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_INFO_NONE is not set
# CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT is not set
CONFIG_DEBUG_INFO_DWARF4=y
# CONFIG_DEBUG_INFO_DWARF5 is not set
# CONFIG_DEBUG_INFO_REDUCED is not set
CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
# CONFIG_DEBUG_INFO_COMPRESSED_ZLIB is not set
# CONFIG_DEBUG_INFO_COMPRESSED_ZSTD is not set
# CONFIG_DEBUG_INFO_SPLIT is not set
CONFIG_DEBUG_INFO_BTF=y
CONFIG_DEBUG_INFO_BTF_MODULES=y

I switched it to CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y and am rebuilding
with lockdep enabled. Curious to see how long it'll take to trigger the
problem with it enabled...

Greetings,

Andres Freund