Re: [PATCH] [RFC] xfs: prevent bmap btree from using alloc btree reserved blocks

Lachlan McIlroy <lmcilroy@xxxxxxxxxx> · Mon, 20 Sep 2010 01:25:49 -0400 (EDT)

Looks good, some questions inline.

----- "Dave Chinner" <david@xxxxxxxxxxxxx> wrote:

> ping?
> 
> On Tue, Aug 24, 2010 at 11:45:33PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > Recently we've had WANT_CORRUPTED_GOTO filesystem shutdowns
> reported
> > on filesystems with large numbers of small AGs. RedHat QA found a
> > simple test case at:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=626244
> > 
> > Which was further refined to:
> > 
> > # mkfs.xfs -f -d agsize=16m,size=50g <dev>
> > # mount <dev> /mnt
> > # xfs_io -f -c 'resvsp 0 40G' /mnt/foo
> > 
> > The initial analysis is in the above bug. The fundamental problem
> is
> > that the data extent allocation is using all the free blocks in the
> > allocation group, and then the bmap btree block allocation is
> > dipping into the reserved block pool that each AG holds for
> > freespace btree manipulations. This results in failures when
> > multiple bmap btree blocks are needed for a multilevel split of the
> > bmap btree.
> > 
> > The reason this is occurring is that when we do a btree block
> > allocation after a data extent allocation we run down the path that
> > does not set up the minleft allocation parameter. That is, we allow
> > the btree block allocation to use up all the blocks in the current
> > AG if that is what will make the current allocation succeed. This
> > what we are supposed to only allow the low space allocation
> > algorithm to do, not a normal allocation. The result is that we can
> > allocate the first block from the AG freelist, and then the second
> > block allocation in the split will fail in the same AG because we
> do
> > not have a minleft parameter set and hence will not pass the checks
> > in xfs_alloc_fix_freelist() and hence the allocation will fail.
> > Further, because no minleft parameter is set, the btree block
> > allocation will not retry the allocation with different parameters
> > and (potentially) enable the low space algorithm.

I think the assumption here is that if the first btree block (with
minleft set) succeeds then all the required free blocks for further
btree allocations will be available if needed and allocations shouldn't
fail.  But this clearly isn't holding true.  Do we have multiple threads
allocating from the same AG stealing each other's minleft blocks?

> > 
> > The result is that we fail a btree block allocation and hence fail
> > the extent allocation with a dirty btree and transaction. The dirty
> > btree causes the WANT_CORRUPTED_GOTO warning, and cancelling the
> > dirty transaction triggers the shutdown.
> > 
> > The fix appears to be to ensure that we set the minleft parameter
> > correctly to reflect the potential number of btree blocks we still
> > need to allocate from the same AG if we are doing a worst case
> > split. By doing so, the particular case results in the first btree
> > block allocation setting minleft to the number of blocks needed for
> > a split. Hence the AG that we just allocated all the free blocks
> out
> > of for the data extent will fail because there are not enough free
> > blocks for all the blocks in the split in the AG. It will fail this
> > without dirtying anything, and because minleft is now > 0, will
> > trigger the retry algorithm.
> > 
> > The fallback algorithm also needs a slight modification. It
> > currently assumes that if minleft is set, no allocation has been
> > done yet, so it can scan from AG 0 to find a free block. If it is
> > left like this, it can trigger deadlocks from the new case as we
> > have a prior allocation and hence cannot allocate from an AG less
> > than the current AG. Hence it is modified to use a START_AG
> > allocation to scan upwards from the current AG, hence avoiding the
> > known AG locking deadlocks.
> > 
> > As far as I can tell, the bmap btree code has never handled this
> > case properly - I checked as far back as 1995 and minleft has never
> > been set to avoid selecting an AG that cannot be allocated out
> of...

I'm not surprised - the low space algorithm was broken about that time
too.

> > 
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_bmap_btree.c |   51
> ++++++++++++++++++++++++++++++++++++++--------
> >  1 files changed, 42 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
> > index 87d3c10..d5ef4e3 100644
> > --- a/fs/xfs/xfs_bmap_btree.c
> > +++ b/fs/xfs/xfs_bmap_btree.c
> > @@ -538,6 +538,25 @@ xfs_bmbt_alloc_block(
> >  		args.type = XFS_ALLOCTYPE_START_BNO;
> >  	} else {
> >  		args.type = XFS_ALLOCTYPE_NEAR_BNO;

Could we use XFS_ALLOCTYPE_START_BNO here so that it automatically tries other
AGs instead of doing it manually (like you've done below)?  It should even
restart from AG 0 if no other allocations have been done.

> > +
> > +		/*
> > +		 * we've come in here because this is the second or subsequent
> > +		 * btree block we need to allocate for the bmap btree
> > +		 * modification. If we've just emptied the AG and there are
> > +		 * only free list blocks left, we need to make sure that we
> > +		 * take into account the minleft value that was reserved on the
> > +		 * first allocation through here (the NULLFSBLOCK branch
> > +		 * above). In that case, minleft will already take into account
> > +		 * the maximum number of blocks needed for a btree split, and
> > +		 * the number of blocks already allocated is recorded in the
> > +		 * cursor. From that, we can work out exactly how much smaller
> > +		 * the minleft should be so that we don't select an AG that
> > +		 * does not have enough blocks available to continue the entire
> > +		 * btree split.
> > +		 */
> > +		args.minleft = XFS_BM_MAXLEVELS(args.mp,
> > +					(int)cur->bc_private.b.whichfork) - 1 -
> > +				cur->bc_private.b.allocated;
> >  	}
> >  
> >  	args.minlen = args.maxlen = args.prod = 1;
> > @@ -550,15 +569,29 @@ xfs_bmbt_alloc_block(
> >  	if (error)
> >  		goto error0;
> >  
> > -	if (args.fsbno == NULLFSBLOCK && args.minleft) {
> > -		/*
> > -		 * Could not find an AG with enough free space to satisfy
> > -		 * a full btree split.  Try again without minleft and if
> > -		 * successful activate the lowspace algorithm.
> > -		 */
> > -		args.fsbno = 0;
> > -		args.type = XFS_ALLOCTYPE_FIRST_AG;
> > -		args.minleft = 0;
> > +	while (args.fsbno == NULLFSBLOCK && args.minleft) {
> > +		if (cur->bc_private.b.firstblock == NULLFSBLOCK) {

Makes sense, need to check b_firstblock since minleft is always set now.
Do we still need the check for minleft here?  The only case I can see that
minleft would be 0 now is for the low space algorithm and there may be some
benefit it letting it try again.

> > +			/*
> > +			 * Could not find an AG with enough free space to satisfy
> > +			 * a full btree split.  Try again without minleft and if
> > +			 * successful activate the lowspace algorithm.
> > +			 */
> > +			args.type = XFS_ALLOCTYPE_FIRST_AG;
> > +			args.fsbno = 0;
> > +			args.minleft = 0;
> > +		} else {

Nice one, allow the allocator to hunt for btree blocks in later AGs.

> > +			/*
> > +			 * Failed to find enough space for a btree block after
> > +			 * a extent allocation has already occurred. Continue
> > +			 * searching other AGs that can hold the remaining
> > +			 * blocks. If we fail with minleft set, then clear it
> > +			 * and try again.
> > +			 */
> > +			args.type = XFS_ALLOCTYPE_START_AG;
> > +			args.fsbno = cur->bc_private.b.firstblock;
> > +			if (cur->bc_private.b.flist->xbf_low)

I don't think xbf_low can be set here - if it was set then minleft
would be 0 and we wouldn't have reached here.

> > +				args.minleft = 0;
> > +		}
> >  		error = xfs_alloc_vextent(&args);
> >  		if (error)
> >  			goto error0;
> > -- 
> > 1.7.1
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@xxxxxxxxxxx
> > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs