Re: [Bug report] overlayfs over xfs whiteout operation may cause deadlock

wenli xie <wlxie7296@xxxxxxxxx> · Tue, 22 Dec 2020 09:29:30 +0800

Thanks Darrick
This patch works fine.  I did some tests for about 3 days, this issue
should be fixed.

During the test, I did 'ps -aux|grep mv'  to get the processes' status.
I can see  some `mv` processes come to 'D' state and then disappear.
So this patch may impact the rename operation's performance of overlay and xfs.
I'd like to have a test, any suggestions?
For overlay, this is fine because the container's rootfs shouldn't
have many rename calls.
What I am most worried about is XFS.

On Fri, Dec 18, 2020 at 5:11 AM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
>
> On Tue, Dec 15, 2020 at 08:44:27PM +0800, wenli xie wrote:
> > I tried upstream kernel 5.10 to do the test, and this issue still  can be
> > reproduced.
>
> Thanks for the report, I've condensed this down to the following:
>
> #!/bin/bash
>
> SCRATCH_MNT=/mnt
> LOAD_FACTOR=1
> TIME_FACTOR=1
>
> mkfs.xfs -f /dev/sda
> mount /dev/sda $SCRATCH_MNT
>
> mkdir $SCRATCH_MNT/lowerdir
> mkdir $SCRATCH_MNT/lowerdir1
> mkdir $SCRATCH_MNT/lowerdir/etc
> mkdir $SCRATCH_MNT/workers
> echo salts > $SCRATCH_MNT/lowerdir/etc/access.conf
> touch $SCRATCH_MNT/running
>
> stop_workers() {
>         test -e $SCRATCH_MNT/running || return
>         rm -f $SCRATCH_MNT/running
>
>         while [ "$(ls $SCRATCH_MNT/workers/ | wc -l)" -gt 0 ]; do
>                 wait
>         done
> }
>
> worker() {
>         local tag="$1"
>         local mergedir="$SCRATCH_MNT/merged$tag"
>         local l="lowerdir=$SCRATCH_MNT/lowerdir:$SCRATCH_MNT/lowerdir1"
>         local u="upperdir=$SCRATCH_MNT/upperdir$tag"
>         local w="workdir=$SCRATCH_MNT/workdir$tag"
>         local i="index=off"
>
>         touch $SCRATCH_MNT/workers/$tag
>         while test -e $SCRATCH_MNT/running; do
>                 rm -rf $SCRATCH_MNT/merged$tag
>                 rm -rf $SCRATCH_MNT/upperdir$tag
>                 rm -rf $SCRATCH_MNT/workdir$tag
>                 mkdir $SCRATCH_MNT/merged$tag
>                 mkdir $SCRATCH_MNT/workdir$tag
>                 mkdir $SCRATCH_MNT/upperdir$tag
>
>                 mount -t overlay overlay -o "$l,$u,$w,$i" $mergedir
>                 mv $mergedir/etc/access.conf $mergedir/etc/access.conf.bak
>                 touch $mergedir/etc/access.conf
>                 mv $mergedir/etc/access.conf $mergedir/etc/access.conf.bak
>                 touch $mergedir/etc/access.conf
>                 umount $mergedir
>         done
>         rm -f $SCRATCH_MNT/workers/$tag
> }
>
> for i in $(seq 0 $((4 + LOAD_FACTOR)) ); do
>         worker $i &
> done
>
> sleep $((30 * TIME_FACTOR))
> stop_workers
>
> ...and I think this is enough to diagnose the deadlock.
>
> This is an ABBA deadlock caused by locking the AGI buffers in the wrong
> order.  Specifically, we seem to be calling xfs_dir_rename with a
> non-null @wip and a non-null @target_ip.  In the deadlock scenario, @wip
> is an inode in AG 2, and @target_ip is an inode in AG 0 with nlink==1.
>
> First we call xfs_iunlink_remove to remove @wip from the unlinked list,
> which causes us to lock AGI 2.  Next we replace the directory entry.
> Finally, we need to droplink @target_ip.  Since @target_ip has nlink==1,
> xfs_droplink will need to put it on AGI 0's unlinked list.
>
> Unfortunately, the locking rules say that you can only lock AGIs in
> increasing order.  This means that we cannot lock AGI 0 after locking
> AGI 2 without risking deadlock.
>
> Does the attached patch fix the deadlock for you?
>
> --D
>
> From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> Subject: [PATCH] xfs: fix an ABBA deadlock in xfs_rename
>
> When overlayfs is running on top of xfs and the user unlinks a file in
> the overlay, overlayfs will create a whiteout inode and ask xfs to
> "rename" the whiteout file atop the one being unlinked.  If the file
> being unlinked loses its one nlink, we then have to put the inode on the
> unlinked list.
>
> This requires us to grab the AGI buffer of the whiteout inode to take it
> off the unlinked list (which is where whiteouts are created) and to grab
> the AGI buffer of the file being deleted.  If the whiteout was created
> in a higher numbered AG than the file being deleted, we'll lock the AGIs
> in the wrong order and deadlock.
>
> Therefore, grab all the AGI locks we think we'll need ahead of time, and
> in the correct order.
>
> Reported-by: wenli xie <wlxie7296@xxxxxxxxx>
> Fixes: 93597ae8dac0 ("xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()")
> Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> ---
>  fs/xfs/xfs_inode.c |   46 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index b7352bc4c815..dd419a1bc6ba 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3000,6 +3000,48 @@ xfs_rename_alloc_whiteout(
>         return 0;
>  }
>
> +/*
> + * For the general case of renaming files, lock all the AGI buffers we need to
> + * handle bumping the nlink of the whiteout inode off the unlinked list and to
> + * handle dropping the nlink of the target inode.  We have to do this in
> + * increasing AG order to avoid deadlocks.
> + */
> +static int
> +xfs_rename_lock_agis(
> +       struct xfs_trans        *tp,
> +       struct xfs_inode        *wip,
> +       struct xfs_inode        *target_ip)
> +{
> +       struct xfs_mount        *mp = tp->t_mountp;
> +       struct xfs_buf          *bp;
> +       xfs_agnumber_t          agi_locks[2] = { NULLAGNUMBER, NULLAGNUMBER };
> +       int                     error;
> +
> +       if (wip)
> +               agi_locks[0] = XFS_INO_TO_AGNO(mp, wip->i_ino);
> +
> +       if (target_ip && VFS_I(target_ip)->i_nlink == 1)
> +               agi_locks[1] = XFS_INO_TO_AGNO(mp, target_ip->i_ino);
> +
> +       if (agi_locks[0] != NULLAGNUMBER && agi_locks[1] != NULLAGNUMBER &&
> +           agi_locks[0] > agi_locks[1])
> +               swap(agi_locks[0], agi_locks[1]);
> +
> +       if (agi_locks[0] != NULLAGNUMBER) {
> +               error = xfs_read_agi(mp, tp, agi_locks[0], &bp);
> +               if (error)
> +                       return error;
> +       }
> +
> +       if (agi_locks[1] != NULLAGNUMBER) {
> +               error = xfs_read_agi(mp, tp, agi_locks[1], &bp);
> +               if (error)
> +                       return error;
> +       }
> +
> +       return 0;
> +}
> +
>  /*
>   * xfs_rename
>   */
> @@ -3130,6 +3172,10 @@ xfs_rename(
>                 }
>         }
>
> +       error = xfs_rename_lock_agis(tp, wip, target_ip);
> +       if (error)
> +               return error;
> +
>         /*
>          * Directory entry creation below may acquire the AGF. Remove
>          * the whiteout from the unlinked list first to preserve correct