[PATCH] [RFC] xfs: prevent bmap btree from using alloc btree reserved blocks

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 24 Aug 2010 23:45:33 +1000

From: Dave Chinner <dchinner@xxxxxxxxxx>

Recently we've had WANT_CORRUPTED_GOTO filesystem shutdowns reported
on filesystems with large numbers of small AGs. RedHat QA found a
simple test case at:

https://bugzilla.redhat.com/show_bug.cgi?id=626244

Which was further refined to:

# mkfs.xfs -f -d agsize=16m,size=50g <dev>
# mount <dev> /mnt
# xfs_io -f -c 'resvsp 0 40G' /mnt/foo

The initial analysis is in the above bug. The fundamental problem is
that the data extent allocation is using all the free blocks in the
allocation group, and then the bmap btree block allocation is
dipping into the reserved block pool that each AG holds for
freespace btree manipulations. This results in failures when
multiple bmap btree blocks are needed for a multilevel split of the
bmap btree.

The reason this is occurring is that when we do a btree block
allocation after a data extent allocation we run down the path that
does not set up the minleft allocation parameter. That is, we allow
the btree block allocation to use up all the blocks in the current
AG if that is what will make the current allocation succeed. This
what we are supposed to only allow the low space allocation
algorithm to do, not a normal allocation. The result is that we can
allocate the first block from the AG freelist, and then the second
block allocation in the split will fail in the same AG because we do
not have a minleft parameter set and hence will not pass the checks
in xfs_alloc_fix_freelist() and hence the allocation will fail.
Further, because no minleft parameter is set, the btree block
allocation will not retry the allocation with different parameters
and (potentially) enable the low space algorithm.

The result is that we fail a btree block allocation and hence fail
the extent allocation with a dirty btree and transaction. The dirty
btree causes the WANT_CORRUPTED_GOTO warning, and cancelling the
dirty transaction triggers the shutdown.

The fix appears to be to ensure that we set the minleft parameter
correctly to reflect the potential number of btree blocks we still
need to allocate from the same AG if we are doing a worst case
split. By doing so, the particular case results in the first btree
block allocation setting minleft to the number of blocks needed for
a split. Hence the AG that we just allocated all the free blocks out
of for the data extent will fail because there are not enough free
blocks for all the blocks in the split in the AG. It will fail this
without dirtying anything, and because minleft is now > 0, will
trigger the retry algorithm.

The fallback algorithm also needs a slight modification. It
currently assumes that if minleft is set, no allocation has been
done yet, so it can scan from AG 0 to find a free block. If it is
left like this, it can trigger deadlocks from the new case as we
have a prior allocation and hence cannot allocate from an AG less
than the current AG. Hence it is modified to use a START_AG
allocation to scan upwards from the current AG, hence avoiding the
known AG locking deadlocks.

As far as I can tell, the bmap btree code has never handled this
case properly - I checked as far back as 1995 and minleft has never
been set to avoid selecting an AG that cannot be allocated out of...

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
 fs/xfs/xfs_bmap_btree.c |   51 ++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
index 87d3c10..d5ef4e3 100644
--- a/fs/xfs/xfs_bmap_btree.c
+++ b/fs/xfs/xfs_bmap_btree.c
@@ -538,6 +538,25 @@ xfs_bmbt_alloc_block(
 		args.type = XFS_ALLOCTYPE_START_BNO;
 	} else {
 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
+
+		/*
+		 * we've come in here because this is the second or subsequent
+		 * btree block we need to allocate for the bmap btree
+		 * modification. If we've just emptied the AG and there are
+		 * only free list blocks left, we need to make sure that we
+		 * take into account the minleft value that was reserved on the
+		 * first allocation through here (the NULLFSBLOCK branch
+		 * above). In that case, minleft will already take into account
+		 * the maximum number of blocks needed for a btree split, and
+		 * the number of blocks already allocated is recorded in the
+		 * cursor. From that, we can work out exactly how much smaller
+		 * the minleft should be so that we don't select an AG that
+		 * does not have enough blocks available to continue the entire
+		 * btree split.
+		 */
+		args.minleft = XFS_BM_MAXLEVELS(args.mp,
+					(int)cur->bc_private.b.whichfork) - 1 -
+				cur->bc_private.b.allocated;
 	}
 
 	args.minlen = args.maxlen = args.prod = 1;
@@ -550,15 +569,29 @@ xfs_bmbt_alloc_block(
 	if (error)
 		goto error0;
 
-	if (args.fsbno == NULLFSBLOCK && args.minleft) {
-		/*
-		 * Could not find an AG with enough free space to satisfy
-		 * a full btree split.  Try again without minleft and if
-		 * successful activate the lowspace algorithm.
-		 */
-		args.fsbno = 0;
-		args.type = XFS_ALLOCTYPE_FIRST_AG;
-		args.minleft = 0;
+	while (args.fsbno == NULLFSBLOCK && args.minleft) {
+		if (cur->bc_private.b.firstblock == NULLFSBLOCK) {
+			/*
+			 * Could not find an AG with enough free space to satisfy
+			 * a full btree split.  Try again without minleft and if
+			 * successful activate the lowspace algorithm.
+			 */
+			args.type = XFS_ALLOCTYPE_FIRST_AG;
+			args.fsbno = 0;
+			args.minleft = 0;
+		} else {
+			/*
+			 * Failed to find enough space for a btree block after
+			 * a extent allocation has already occurred. Continue
+			 * searching other AGs that can hold the remaining
+			 * blocks. If we fail with minleft set, then clear it
+			 * and try again.
+			 */
+			args.type = XFS_ALLOCTYPE_START_AG;
+			args.fsbno = cur->bc_private.b.firstblock;
+			if (cur->bc_private.b.flist->xbf_low)
+				args.minleft = 0;
+		}
 		error = xfs_alloc_vextent(&args);
 		if (error)
 			goto error0;
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs