On Sun, Sep 23, 2018 at 2:54 AM 张本龙 <zbl.lkml@xxxxxxxxx> wrote: > > Hi Dave, > > May I ask if the problem has been fixed? It's encountered on our > 3.10.0-514.16.1.el7.x86_64 production environment too. > > > Cannot find info other places so just wanna confirm. > > Thanks, > Benlong > Dave Chinner <david@xxxxxxxxxxxxx> 于2017年10月8日周日 上午6:55写道 This behaviour is still preserved in 4.18. > > > > On Fri, Oct 06, 2017 at 03:29:01PM +0300, Amir Goldstein wrote: > > > On Mon, Oct 2, 2017 at 1:49 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > On Sun, Oct 01, 2017 at 03:10:03PM -0700, Sargun Dhillon wrote: > > > >> I'm running into an issue where xfs aild is locking up. This is on > > > >> kernel version 4.9.34. It's an SMP system with 32 cores, and ~250G of > > > >> RAM (AWS R4.8XL) and an XFS filesystem with 1 SSD with project ID > > > >> quotas in use. It's the only XFS filesystem on the host. The root > > > >> partition is running EXT4, and isn't involved in this. > > > >> > > > >> There are containers that use overlayfs atop this filesystem. It looks > > > >> like one of the processes (10090, or 11504) has gotten into a state > > > >> where it's holding a lock on a xfs_buf, and they're trying to lock > > > >> xfs_buf's which are currently on the xfs ail list. > > > >> > > > ... > > > > Ok, this is a RENAME_WHITEOUT case, and that points to the issue. > > > > The whiteout inode is allocated as a temporary inode, which means > > > > it remains on the unlinked list so that if we crash part way through > > > > the update log recovery will free it again. > > > > > > > > Once all the dirent updates and other rename work is done, we remove > > > > the whiteout inode from the unlinked list, and that requires > > > > grabbing the AGI lock. That's what we are stuck on here. > > > > > > > ... > > > > > > > > Because this is the deadlock - we're trying to lock the AGF with an > > > > AGI already locked. That means the above RENAME_WHITEOUT has either > > > > allocated or freed extents in manipulating the dirents during > > > > rename, and so holds an AGF locked. It's a classic ABBA deadlock. > > > > > > > > That's the problem, not sure what the solution is yet - there's no > > > > obvious or simple way around this RENAME_WHITEOUT behaviour (which > > > > only affects overlay, fwiw). I'll have a think about it. > > > > > > > > > > Dave, > > > > > > Could you explain why the RENAME_WHITEOUT case is different locking > > > order wise from linking an O_TEMPFILE? > > > > RENAME_WHITEOUT creates an O_TMPFILE inode in the form or a whiteout > > device node in an initial separate transaction, which is fine as > > this uses the normal AGI->AGF lock order. This is so if we crash > > while doing the rename, log recovery will free the (now unused) > > whiteout device node. > > > > We then use the O_TMPFILE inode in the rename, which does all the > > directory mods, which take the AGF lock to allocate directory > > blocks. Then it removes the O_TMPFILE inode from the unlinked list > > because it's now referenced by a directory which takes the AGI lock. > > > > > > > Is it because xfs_iunlink_remove() is called before xfs_dir_createname() > > > in xfs_link()? > > > > Yes, and gives the lock order AGI->AGF. > > > > > Also, in xfs_rename(), before removing whiteout inode from unlinked list, > > > the comment says: "If we fail here after bumping the link > > > * count, we're shutting down the filesystem so we'll never see the > > > * intermediate state on disk.", > > > but I am not actually seeing where that shutdown takes place, or maybe > > > I don't know what to look for. > > > > Say, for example, the journal commit fails. That'll shut down the > > filesystem. > > > > The reason for doing the xfs_iunlink_remove() call last is that > > if we get a failure to modify a directory entry (e.g. we're at > > ENOSPC) before we've dirtied anything in the directories, we'll end > > up cancelling a dirty transaction and that will shut down the > > filesystem. On failures that should just return an error, we have > > to put the inode back on the unlinked list so we can commit the > > dirty transaction and return the error. It quickly becomes very > > nasty trying to work out all the different error paths and exactly > > which ones are fatal (i.e. trigger a shutdown) and those that need > > to (and can) put the inode back on the unlinked list successfully > > and commit the transaction so things can proceed. > > > > Ideally we need to make the unlinked list removal a deferred op > > (i.e. log an intent and then do it after the RENAME_WHITEOUT > > transaction has committed). That would leave the error handling > > simple and get rid of the lock order problem because it would move > > the unlinked list removal to it's own transaction.... > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@xxxxxxxxxxxxx > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html