Re: [syzbot] [xfs?] INFO: task hung in xfs_ail_push_all_sync

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 23 Sep 2024 09:15:19 +1000

On Sat, Sep 21, 2024 at 05:30:23AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    a430d95c5efa Merge tag 'lsm-pr-20240911' of git://git.kern..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=15569d00580000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=d9ab5893ec5191eb
> dashboard link: https://syzkaller.appspot.com/bug?extid=bf0143df12877cb1a21b
> compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=109e6a77980000
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/f8549392ace4/disk-a430d95c.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/9b866ef0c06c/vmlinux-a430d95c.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/9d6691641029/bzImage-a430d95c.xz
> mounted in repro: https://storage.googleapis.com/syzbot-assets/dd50a033ca89/mount_2.gz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+bf0143df12877cb1a21b@xxxxxxxxxxxxxxxxxxxxxxxxx
> 
> INFO: task syz.0.75:6468 blocked for more than 143 seconds.
>       Not tainted 6.11.0-syzkaller-02574-ga430d95c5efa #0
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:syz.0.75        state:D stack:24864 pid:6468  tgid:6446  ppid:6378   flags:0x00004004
> Call Trace:
>  <TASK>
>  context_switch kernel/sched/core.c:5188 [inline]
>  __schedule+0x17ae/0x4a10 kernel/sched/core.c:6529
>  __schedule_loop kernel/sched/core.c:6606 [inline]
>  schedule+0x14b/0x320 kernel/sched/core.c:6621
>  xfs_ail_push_all_sync+0x236/0x310 fs/xfs/xfs_trans_ail.c:726
>  xfs_log_quiesce+0xdf/0x5b0 fs/xfs/xfs_log.c:1018
>  xfs_fs_freeze+0x8d/0x1a0 fs/xfs/xfs_super.c:936
>  freeze_super+0x81b/0xee0 fs/super.c:2107
>  fs_bdev_freeze+0x1ac/0x320 fs/super.c:1484
>  bdev_freeze+0xd6/0x220 block/bdev.c:257
>  xfs_fs_goingdown+0xa9/0x160 fs/xfs/xfs_fsops.c:446
>  xfs_file_ioctl+0x12d4/0x19e0 fs/xfs/xfs_ioctl.c:1473
>  vfs_ioctl fs/ioctl.c:51 [inline]
>  __do_sys_ioctl fs/ioctl.c:907 [inline]
>  __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7fc862d7def9
> RSP: 002b:00007fc863b4a038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 00007fc862f36058 RCX: 00007fc862d7def9
> RDX: 0000000020000080 RSI: 000000008004587d RDI: 0000000000000006
> RBP: 00007fc862df0b76 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 0000000000000000 R14: 00007fc862f36058 R15: 00007ffd1aceae18
>  </TASK>

Not a bug.

The reproducer is mounting the filesystem, then some other thread
is immediately resizing the block device underneath the filesystem
to zero length.

In the cases where the test "passes", this happens:

[  215.851387][ T6299] XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
[  215.895695][ T6299] XFS (loop0): Ending clean mount
[  215.953257][ T6299] loop0: detected capacity change from 32768 to 0
[  215.963786][ T6299] syz.0.70: attempt to access beyond end of device
[  215.963786][ T6299] loop0: rw=432129, sector=768, nr_sectors = 128 limit=0
[  215.978206][   T45] XFS (loop0): log I/O error -5
[  215.984613][   T45] XFS (loop0): Filesystem has been shut down due to log error (0x2).
[  215.992720][   T45] XFS (loop0): Please unmount the filesystem and rectify the problem(s).
[  216.011609][ T6230] XFS (loop0): Unmounting Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791o

The loop device change of size happens about 600ms after the mount
finishes, and the next IO that is issued is a log IO. That gets
errored out and the filesystem shuts itseld down.

In the case where it has hung:

231.577098][ T6447] loop0: detected capacity change from 0 to 32768
[  231.590389][ T6447] XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
[  231.642021][ T6447] XFS (loop0): Ending clean mount
[  231.713195][ T6447] loop0: detected capacity change from 32768 to 0
[  231.727969][ T6467] xfsaild/loop0: attempt to access beyond end of device
[  231.727969][ T6467] loop0: rw=4097, sector=2, nr_sectors = 1 limit=0
[  231.742236][ T6467] xfsaild/loop0: attempt to access beyond end of device
[  231.742236][ T6467] loop0: rw=4097, sector=12, nr_sectors = 4 limit=0
....

The loop dev changes size about 700ms afte the mount, and the log IO
has already completed. Hence the next IOs are from the user driven
shutdown (not sure where that is coming from - nothing in syzbot
logs, and no C reproducer was provided) freezing the device before
shutting down the filesystem. These error out, but don't cause a
shutdown because the default XFS config for async metadata writeback
failures is "retry forever":

.....
[  231.849797][   T52] kworker/1:1: attempt to access beyond end of device
[  231.849797][   T52] loop0: rw=4097, sector=18560, nr_sectors = 64 limit=0
[  231.933061][ T6467] XFS (loop0): Failing async write on buffer block 0x2. Retrying async write.
[  231.942187][ T6467] XFS (loop0): Failing async write on buffer block 0x10. Retrying async write.
[  231.951196][ T6467] XFS (loop0): Failing async write on buffer block 0xc. Retrying async write.
[  231.960108][ T6467] xfsaild/loop0: attempt to access beyond end of device
[  231.960108][ T6467] loop0: rw=4097, sector=2, nr_sectors = 1 limit=0
[  231.973536][ T6467] xfsaild/loop0: attempt to access beyond end of device
[  231.973536][ T6467] loop0: rw=4097, sector=12, nr_sectors = 4 limit=0
[  232.053094][ T6467] XFS (loop0): Failing async write on buffer block 0x2. Retrying async write.
[  232.061978][ T6467] XFS (loop0): Failing async write on buffer block 0x10. Retrying async write.
[  232.070960][ T6467] XFS (loop0): Failing async write on buffer block 0xc. Retrying async write.
[  232.143152][ T6467] XFS (loop0): Failing async write on buffer block 0x2. Retrying async write.
[  232.152027][ T6467] XFS (loop0): Failing async write on buffer block 0x10. Retrying async write.
.....

And so the user driven shutdown hangs at this point waiting for
metadata to flush.

This is expected behaviour. If the user wants metadata writes to
fail immediately in this situation, then they need to do:

# echo 1 > /sys/fs/xfs/<dev>/error/metadata/EIO/max_retries

So on IO error the writeback code will only retry once before
failing and shutting down the filesystem immediately.

So, no bug here, just syzbot not understanding how metadata IO
errors are handled by different filesystems. Hence:

#syz invalid

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx