Re: [syzbot] [xfs?] INFO: task hung in xfs_ail_push_all_sync (2)

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 22 Oct 2024 09:44:43 +1100

On Fri, Oct 18, 2024 at 12:13:33PM +0200, Aleksandr Nogikh wrote:
> Hi Dave,
> 
> On Thu, Oct 17, 2024 at 2:53 AM 'Dave Chinner' via syzkaller-bugs
> <syzkaller-bugs@xxxxxxxxxxxxxxxx> wrote:
> >
> > On Wed, Oct 16, 2024 at 04:22:27PM -0700, syzbot wrote:
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit:    09f6b0c8904b Merge tag 'linux_kselftest-fixes-6.12-rc3' of..
> > > git tree:       upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=14af3fd0580000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=7cd9e7e4a8a0a15b
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=611be8174be36ca5dbc9
> > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16c7705f980000
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14d2fb27980000
> >
> 
> It's better to just leave the issue open until syzbot actually stops
> triggering it. Otherwise, after every "#syz invalid", the crash will
> be eventually seen again and re-sent to the mailing lists.
> 
> In the other email you mentioned
> "/sys/fs/xfs/<dev>/error/metadata/EIO/max_retries" as the only way to
> prevent this hang. Must max_retries be set every time after xfs is
> mounted? Or is it possible to somehow preconfigure it once at VM boot
> and then no longer worry about it during fuzzing?

It's a post mount config because the filesystem has to be mounted
before the error config files show up in /sys/fs/xfs/<dev>/....

For example, in fstests we set "fail_at_unmount" specifically when
running a test that will error out all writes and then unmount.

The code that does this is in common/xfs:

# Prepare a mounted filesystem for an IO error shutdown test by disabling retry
# for metadata writes.  This prevents a (rare) log livelock when:
#
# - The log has given out all available grant space, preventing any new
#   writers from tripping over IO errors (and shutting down the fs/log),
# - All log buffers were written to disk, and
# - The log tail is pinned because the AIL keeps hitting EIO trying to write
#   committed changes back into the filesystem.
#
# Real users might want the default behavior of the AIL retrying writes forever
# but for testing purposes we don't want to wait.
#
# The sole parameter should be the filesystem data device, e.g. $SCRATCH_DEV.
_xfs_prepare_for_eio_shutdown()
{
        local dev="$1"
        local ctlfile="error/fail_at_unmount"

        # Once we enable IO errors, it's possible that a writer thread will
        # trip over EIO, cancel the transaction, and shut down the system.
        # This is expected behavior, so we need to remove the "Internal error"
        # message from the list of things that can cause the test to be marked
        # as failed.
        _add_dmesg_filter "Internal error"

        # Don't retry any writes during the (presumably) post-shutdown unmount
        _has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 1

        # Disable retry of metadata writes that fail with EIO
        for ctl in max_retries retry_timeout_seconds; do
                ctlfile="error/metadata/EIO/$ctl"

                _has_fs_sysfs "$ctlfile" && _set_fs_sysfs_attr $dev "$ctlfile" 0
        done
}

However, this does not address the same issue when a filesystem
freeze is run (because it has to bring the on-disk state down to the
same as a clean unmounted filesystem). Hence for syzbot, the only
way to avoid this sort of issue is to cap the maximum number of
retries so that metadata writes fail as soon as the device starts
rejecting them.

Realistically, we want syzbot to exercise both the retry logic and
the hard fail logic. Right now it is only exercising the retry
logic, so setting the max retries to, say, three retries would
exercise both the retry logic and the hard fail logic and still
avoid all the potential "livelock until user intervention" test
hangs...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx