Re: XFS AIL lockup

Sargun Dhillon <sargun@xxxxxxxxx> · Mon, 29 Oct 2018 17:05:22 -0700



On Sun, Sep 23, 2018 at 2:54 AM 张本龙 <zbl.lkml@xxxxxxxxx> wrote:
>
> Hi Dave,
>
> May I ask if the problem has been fixed? It's encountered on our
> 3.10.0-514.16.1.el7.x86_64 production environment too.
>
>
> Cannot find info other places so just wanna confirm.
>
> Thanks,
> Benlong
> Dave Chinner <david@xxxxxxxxxxxxx> 于2017年10月8日周日 上午6:55写道
This behaviour is still preserved in 4.18.

> >
> > On Fri, Oct 06, 2017 at 03:29:01PM +0300, Amir Goldstein wrote:
> > > On Mon, Oct 2, 2017 at 1:49 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > On Sun, Oct 01, 2017 at 03:10:03PM -0700, Sargun Dhillon wrote:
> > > >> I'm running into an issue where xfs aild is locking up. This is on
> > > >> kernel version 4.9.34. It's an SMP system with 32 cores, and ~250G of
> > > >> RAM (AWS R4.8XL) and an XFS filesystem with 1 SSD with project ID
> > > >> quotas in use. It's the only XFS filesystem on the host. The root
> > > >> partition is running EXT4, and isn't involved in this.
> > > >>
> > > >> There are containers that use overlayfs atop this filesystem. It looks
> > > >> like one of the processes (10090, or 11504) has gotten into a state
> > > >> where it's holding a lock on a xfs_buf, and they're trying to lock
> > > >> xfs_buf's which are currently on the xfs ail list.
> > > >>
> > > ...
> > > > Ok, this is a RENAME_WHITEOUT case, and that points to the issue.
> > > > The whiteout inode is allocated as a temporary inode, which means
> > > > it remains on the unlinked list so that if we crash part way through
> > > > the update log recovery will free it again.
> > > >
> > > > Once all the dirent updates and other rename work is done, we remove
> > > > the whiteout inode from the unlinked list, and that requires
> > > > grabbing the AGI lock. That's what we are stuck on here.
> > > >
> > > ...
> > > >
> > > > Because this is the deadlock - we're trying to lock the AGF with an
> > > > AGI already locked. That means the above RENAME_WHITEOUT has either
> > > > allocated or freed extents in manipulating the dirents during
> > > > rename, and so holds an AGF locked. It's a classic ABBA deadlock.
> > > >
> > > > That's the problem, not sure what the solution is yet - there's no
> > > > obvious or simple way around this RENAME_WHITEOUT behaviour (which
> > > > only affects overlay, fwiw). I'll have a think about it.
> > > >
> > >
> > > Dave,
> > >
> > > Could you explain why the RENAME_WHITEOUT case is different locking
> > > order wise from linking an O_TEMPFILE?
> >
> > RENAME_WHITEOUT creates an O_TMPFILE inode in the form or a whiteout
> > device node in an initial separate transaction, which is fine as
> > this uses the normal AGI->AGF lock order. This is so if we crash
> > while doing the rename, log recovery will free the (now unused)
> > whiteout device node.
> >
> > We then use the O_TMPFILE inode in the rename, which does all the
> > directory mods, which take the AGF lock to allocate directory
> > blocks. Then it removes the O_TMPFILE inode from the unlinked list
> > because it's now referenced by a directory which takes the AGI lock.
> >
> >
> > > Is it because xfs_iunlink_remove() is called before xfs_dir_createname()
> > > in xfs_link()?
> >
> > Yes, and gives the lock order AGI->AGF.
> >
> > > Also, in xfs_rename(), before removing whiteout inode from unlinked list,
> > > the comment says: "If we fail here after bumping the link
> > >          * count, we're shutting down the filesystem so we'll never see the
> > >          * intermediate state on disk.",
> > > but I am not actually seeing where that shutdown takes place, or maybe
> > > I don't know what to look for.
> >
> > Say, for example, the journal commit fails. That'll shut down the
> > filesystem.
> >
> > The reason for doing the xfs_iunlink_remove() call last is that
> > if we get a failure to modify a directory entry (e.g. we're at
> > ENOSPC) before we've dirtied anything in the directories, we'll end
> > up cancelling a dirty transaction and that will shut down the
> > filesystem. On failures that should just return an error, we have
> > to put the inode back on the unlinked list so we can commit the
> > dirty transaction and return the error. It quickly becomes very
> > nasty trying to work out all the different error paths and exactly
> > which ones are fatal (i.e. trigger a shutdown) and those that need
> > to (and can) put the inode back on the unlinked list successfully
> > and commit the transaction so things can proceed.
> >
> > Ideally we need to make the unlinked list removal a deferred op
> > (i.e. log an intent and then do it after the RENAME_WHITEOUT
> > transaction has committed). That would leave the error handling
> > simple and get rid of the lock order problem because it would move
> > the unlinked list removal to it's own transaction....
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@xxxxxxxxxxxxx
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html