Thanks Darrick This patch works fine. I did some tests for about 3 days, this issue should be fixed. During the test, I did 'ps -aux|grep mv' to get the processes' status. I can see some `mv` processes come to 'D' state and then disappear. So this patch may impact the rename operation's performance of overlay and xfs. I'd like to have a test, any suggestions? For overlay, this is fine because the container's rootfs shouldn't have many rename calls. What I am most worried about is XFS. On Fri, Dec 18, 2020 at 5:11 AM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > > On Tue, Dec 15, 2020 at 08:44:27PM +0800, wenli xie wrote: > > I tried upstream kernel 5.10 to do the test, and this issue still can be > > reproduced. > > Thanks for the report, I've condensed this down to the following: > > #!/bin/bash > > SCRATCH_MNT=/mnt > LOAD_FACTOR=1 > TIME_FACTOR=1 > > mkfs.xfs -f /dev/sda > mount /dev/sda $SCRATCH_MNT > > mkdir $SCRATCH_MNT/lowerdir > mkdir $SCRATCH_MNT/lowerdir1 > mkdir $SCRATCH_MNT/lowerdir/etc > mkdir $SCRATCH_MNT/workers > echo salts > $SCRATCH_MNT/lowerdir/etc/access.conf > touch $SCRATCH_MNT/running > > stop_workers() { > test -e $SCRATCH_MNT/running || return > rm -f $SCRATCH_MNT/running > > while [ "$(ls $SCRATCH_MNT/workers/ | wc -l)" -gt 0 ]; do > wait > done > } > > worker() { > local tag="$1" > local mergedir="$SCRATCH_MNT/merged$tag" > local l="lowerdir=$SCRATCH_MNT/lowerdir:$SCRATCH_MNT/lowerdir1" > local u="upperdir=$SCRATCH_MNT/upperdir$tag" > local w="workdir=$SCRATCH_MNT/workdir$tag" > local i="index=off" > > touch $SCRATCH_MNT/workers/$tag > while test -e $SCRATCH_MNT/running; do > rm -rf $SCRATCH_MNT/merged$tag > rm -rf $SCRATCH_MNT/upperdir$tag > rm -rf $SCRATCH_MNT/workdir$tag > mkdir $SCRATCH_MNT/merged$tag > mkdir $SCRATCH_MNT/workdir$tag > mkdir $SCRATCH_MNT/upperdir$tag > > mount -t overlay overlay -o "$l,$u,$w,$i" $mergedir > mv $mergedir/etc/access.conf $mergedir/etc/access.conf.bak > touch $mergedir/etc/access.conf > mv $mergedir/etc/access.conf $mergedir/etc/access.conf.bak > touch $mergedir/etc/access.conf > umount $mergedir > done > rm -f $SCRATCH_MNT/workers/$tag > } > > for i in $(seq 0 $((4 + LOAD_FACTOR)) ); do > worker $i & > done > > sleep $((30 * TIME_FACTOR)) > stop_workers > > ...and I think this is enough to diagnose the deadlock. > > This is an ABBA deadlock caused by locking the AGI buffers in the wrong > order. Specifically, we seem to be calling xfs_dir_rename with a > non-null @wip and a non-null @target_ip. In the deadlock scenario, @wip > is an inode in AG 2, and @target_ip is an inode in AG 0 with nlink==1. > > First we call xfs_iunlink_remove to remove @wip from the unlinked list, > which causes us to lock AGI 2. Next we replace the directory entry. > Finally, we need to droplink @target_ip. Since @target_ip has nlink==1, > xfs_droplink will need to put it on AGI 0's unlinked list. > > Unfortunately, the locking rules say that you can only lock AGIs in > increasing order. This means that we cannot lock AGI 0 after locking > AGI 2 without risking deadlock. > > Does the attached patch fix the deadlock for you? > > --D > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > Subject: [PATCH] xfs: fix an ABBA deadlock in xfs_rename > > When overlayfs is running on top of xfs and the user unlinks a file in > the overlay, overlayfs will create a whiteout inode and ask xfs to > "rename" the whiteout file atop the one being unlinked. If the file > being unlinked loses its one nlink, we then have to put the inode on the > unlinked list. > > This requires us to grab the AGI buffer of the whiteout inode to take it > off the unlinked list (which is where whiteouts are created) and to grab > the AGI buffer of the file being deleted. If the whiteout was created > in a higher numbered AG than the file being deleted, we'll lock the AGIs > in the wrong order and deadlock. > > Therefore, grab all the AGI locks we think we'll need ahead of time, and > in the correct order. > > Reported-by: wenli xie <wlxie7296@xxxxxxxxx> > Fixes: 93597ae8dac0 ("xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()") > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > --- > fs/xfs/xfs_inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 46 insertions(+) > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index b7352bc4c815..dd419a1bc6ba 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -3000,6 +3000,48 @@ xfs_rename_alloc_whiteout( > return 0; > } > > +/* > + * For the general case of renaming files, lock all the AGI buffers we need to > + * handle bumping the nlink of the whiteout inode off the unlinked list and to > + * handle dropping the nlink of the target inode. We have to do this in > + * increasing AG order to avoid deadlocks. > + */ > +static int > +xfs_rename_lock_agis( > + struct xfs_trans *tp, > + struct xfs_inode *wip, > + struct xfs_inode *target_ip) > +{ > + struct xfs_mount *mp = tp->t_mountp; > + struct xfs_buf *bp; > + xfs_agnumber_t agi_locks[2] = { NULLAGNUMBER, NULLAGNUMBER }; > + int error; > + > + if (wip) > + agi_locks[0] = XFS_INO_TO_AGNO(mp, wip->i_ino); > + > + if (target_ip && VFS_I(target_ip)->i_nlink == 1) > + agi_locks[1] = XFS_INO_TO_AGNO(mp, target_ip->i_ino); > + > + if (agi_locks[0] != NULLAGNUMBER && agi_locks[1] != NULLAGNUMBER && > + agi_locks[0] > agi_locks[1]) > + swap(agi_locks[0], agi_locks[1]); > + > + if (agi_locks[0] != NULLAGNUMBER) { > + error = xfs_read_agi(mp, tp, agi_locks[0], &bp); > + if (error) > + return error; > + } > + > + if (agi_locks[1] != NULLAGNUMBER) { > + error = xfs_read_agi(mp, tp, agi_locks[1], &bp); > + if (error) > + return error; > + } > + > + return 0; > +} > + > /* > * xfs_rename > */ > @@ -3130,6 +3172,10 @@ xfs_rename( > } > } > > + error = xfs_rename_lock_agis(tp, wip, target_ip); > + if (error) > + return error; > + > /* > * Directory entry creation below may acquire the AGF. Remove > * the whiteout from the unlinked list first to preserve correct