Re: [syzbot] INFO: task hung in ext4_fallocate

Andres Freund <andres@xxxxxxxxxxx> · Tue, 3 Oct 2023 07:11:38 -0700

Hi,

On 2023-09-18 18:07:54 -0400, Theodore Ts'o wrote:
> On Mon, Sep 18, 2023 at 08:13:30PM +0530, Shreeya Patel wrote:
> > When I tried to reproduce this issue on mainline linux kernel using the
> > reproducer provided by syzbot, I see an endless loop of following errors :-
> >
> > [   89.689751][ T3922] loop1: detected capacity change from 0 to 512
> > [   89.690514][ T3916] EXT4-fs error (device loop4): ext4_map_blocks:577:
> > inode #3: block 9: comm poc: lblock 0 mapped to illegal pblock 9 (length 1)
> > [   89.694709][ T3890] EXT4-fs error (device loop0): ext4_map_blocks:577:
>
> Please note that maliciously corrupted file system is considered low
> priority by ext4 developers.

FWIW, I am seeing occasional hangs in ext4_fallocate with 6.6-rc4 as well,
just doing database development on my laptop.

Oct 03 06:48:30 alap5 kernel: INFO: task postgres:132485 blocked for more than 122 seconds.
Oct 03 06:48:30 alap5 kernel:       Not tainted 6.6.0-rc4-andres-00009-ge38f6d0c8360 #74
Oct 03 06:48:30 alap5 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 03 06:48:30 alap5 kernel: task:postgres        state:D stack:0     pid:132485 ppid:132017 flags:0x00020006
Oct 03 06:48:30 alap5 kernel: Call Trace:
Oct 03 06:48:30 alap5 kernel:  <TASK>
Oct 03 06:48:30 alap5 kernel:  __schedule+0x388/0x13e0
Oct 03 06:48:30 alap5 kernel:  ? nvme_queue_rqs+0x1e6/0x290
Oct 03 06:48:30 alap5 kernel:  schedule+0x5f/0xe0
Oct 03 06:48:30 alap5 kernel:  inode_dio_wait+0xd5/0x100
Oct 03 06:48:30 alap5 kernel:  ? membarrier_register_private_expedited+0xa0/0xa0
Oct 03 06:48:30 alap5 kernel:  ext4_fallocate+0x12f/0x1040
Oct 03 06:48:30 alap5 kernel:  ? io_submit_sqes+0x392/0x660
Oct 03 06:48:30 alap5 kernel:  vfs_fallocate+0x135/0x360
Oct 03 06:48:30 alap5 kernel:  __x64_sys_fallocate+0x42/0x70
Oct 03 06:48:30 alap5 kernel:  do_syscall_64+0x38/0x80
Oct 03 06:48:30 alap5 kernel:  entry_SYSCALL_64_after_hwframe+0x46/0xb0
Oct 03 06:48:30 alap5 kernel: RIP: 0033:0x7fda32415f82
Oct 03 06:48:30 alap5 kernel: RSP: 002b:00007ffedbea0828 EFLAGS: 00000246 ORIG_RAX: 000000000000011d
Oct 03 06:48:30 alap5 kernel: RAX: ffffffffffffffda RBX: 00000000002a6000 RCX: 00007fda32415f82
Oct 03 06:48:30 alap5 kernel: RDX: 0000000014b22000 RSI: 0000000000000000 RDI: 0000000000000016
Oct 03 06:48:30 alap5 kernel: RBP: 0000000014b22000 R08: 0000000014b22000 R09: 0000558f70eabed0
Oct 03 06:48:30 alap5 kernel: R10: 00000000002a6000 R11: 0000000000000246 R12: 00000000000000e0
Oct 03 06:48:30 alap5 kernel: R13: 000000000a000013 R14: 0000000000000004 R15: 0000558f709d4e70
Oct 03 06:48:30 alap5 kernel:  </TASK>

cat /proc/132485/stack
[<0>] inode_dio_wait+0xd5/0x100
[<0>] ext4_fallocate+0x12f/0x1040
[<0>] vfs_fallocate+0x135/0x360
[<0>] __x64_sys_fallocate+0x42/0x70
[<0>] do_syscall_64+0x38/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

cat /sys/kernel/debug/block/nvme1n1/hctx*/busy
doesn't show any active IO. Nor does the issue resolve when resetting the
controller - so I don't think this is a disk/block level issue.

This is on 8f1b4600373f, with one local, unrelated, commit ontop.

Unfortunately, so far, I've only reproduced this every couple hours of
interactive work, so bisecting isn't really feasible. I've not hit it on 6.5
over a considerably longer time, so I am reasonably confident this isn't an
older issue.

Greetings,

Andres Freund