On Tue, Aug 13, 2019 at 09:36:14AM -0400, Brian Foster wrote: > On Tue, Aug 13, 2019 at 07:17:33PM +0800, kaixuxia wrote: > > When performing rename operation with RENAME_WHITEOUT flag, we will > > hold AGF lock to allocate or free extents in manipulating the dirents > > firstly, and then doing the xfs_iunlink_remove() call last to hold > > AGI lock to modify the tmpfile info, so we the lock order AGI->AGF. > > > > IIUC, the whiteout use case is that we're renaming a file, but the > source dentry must be replaced with a magic whiteout inode rather than > be removed. Therefore, xfs_rename() allocates the whiteout inode as a > tmpfile first in a separate transaction, updates the target dentry with > the source inode, replaces the source dentry to point to the whiteout > inode and finally removes the whiteout inode from the unlinked list > (since it is a tmpfile). This leads to the problem described below > because the rename transaction ends up doing directory block allocs > (locking the AGF) followed by the unlinked list remove (locking the > AGI). > > My understanding from reading the code is that this is primarly to > cleanly handle error scenarios. If anything fails after we've allocated > the whiteout tmpfile, it's simply left on the unlinked list and so the > filesystem remains in a consistent/recoverable state. Given that, the > solution here seems like overkill to me. For one, I thought background > unlinked list removal was already on our roadmap (Darrick might have > been looking at that and may already have a prototype as well). Also, > unlinked list removal occurs at log recovery time already. That's > somewhat of an existing purpose of the list, which makes a deferred > unlinked list removal operation superfluous in more traditional cases > where unlinked list removal doesn't require consistency with a directory > operation. > > Functional discussion aside.. from a complexity standpoint I'm wondering > if we could do something much more simple like acquire the AGI lock for > a whiteout inode earlier in xfs_rename(). For example, suppose we did > something like: > > /* > * Acquire the whiteout agi to preserve locking order in anticipation of > * unlinked list removal. > */ > if (wip) > xfs_read_agi(mp, tp, XFS_INO_TO_AGNO(mp, wip->i_ino), &agibp); > > ... after we allocate the transaction but before we do any directory ops > that can result in block allocations. Would that prevent the problem > you've observed? I'd prefer that we just do things in an order that doesn't invert the locking. For a whiteout, we only allocate blocks when modifying the target directory, and we do a check to see if that will succeed before actually doing the directory modification. That means the directory modification will only fail due to an IO error or corruption, both of which have a high probability of causing the filesystem to be shut down. Any error after the directory mod will cause a shutdown because the transaction is dirty. Further, the operation that will lock the AGF is the target directory modification if blocks need to be allocated, and the whole point of the "check before execution" is to abort if ENOSPC would occur as a result of trying to allocate blocks and we don't have a space reservation for for those blocks because we are very, very close to ENOSPC already. If we fail the xfs_iunlink_remove() operation, we're shutting down the filesystem. If we fail the xfs_dir_createname(target) call, we are most likely going to be shutting down the filesystem. So the premise that locating xfs_iunlink_remove() at the end to make error handling easy is not really true - transaction cancel will clean both of them up and shut the filesystem down. Hence I think the right thing to do is to move the xfs_iunlink_remove() call to between xfs_dir_canenter() and xfs_dir_createname(). This means ENOSPC will abort with a clean transaction and all is good, otherwise a failure is most likely going to shut down the filesystem and it doesn't matter if we do xfs_iunlink_remove() or xfs_dir_createname() first. And by doing xfs_iunlink_remove() first, we remove the AGI/AGF lock inversion problem.... I think this holds together, but I might have missed something in the tangle of different rename operation cases. So it's worth checking, but it looks to me like a better solution than having a bare AGI lock in the middle of the function to work around error handling logic we didn't clearly enough about at the time (hindsight and all that jazz).... Thoughts? -Dave. -- Dave Chinner david@xxxxxxxxxxxxx