ping? On Tue, Aug 24, 2010 at 11:45:33PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Recently we've had WANT_CORRUPTED_GOTO filesystem shutdowns reported > on filesystems with large numbers of small AGs. RedHat QA found a > simple test case at: > > https://bugzilla.redhat.com/show_bug.cgi?id=626244 > > Which was further refined to: > > # mkfs.xfs -f -d agsize=16m,size=50g <dev> > # mount <dev> /mnt > # xfs_io -f -c 'resvsp 0 40G' /mnt/foo > > The initial analysis is in the above bug. The fundamental problem is > that the data extent allocation is using all the free blocks in the > allocation group, and then the bmap btree block allocation is > dipping into the reserved block pool that each AG holds for > freespace btree manipulations. This results in failures when > multiple bmap btree blocks are needed for a multilevel split of the > bmap btree. > > The reason this is occurring is that when we do a btree block > allocation after a data extent allocation we run down the path that > does not set up the minleft allocation parameter. That is, we allow > the btree block allocation to use up all the blocks in the current > AG if that is what will make the current allocation succeed. This > what we are supposed to only allow the low space allocation > algorithm to do, not a normal allocation. The result is that we can > allocate the first block from the AG freelist, and then the second > block allocation in the split will fail in the same AG because we do > not have a minleft parameter set and hence will not pass the checks > in xfs_alloc_fix_freelist() and hence the allocation will fail. > Further, because no minleft parameter is set, the btree block > allocation will not retry the allocation with different parameters > and (potentially) enable the low space algorithm. > > The result is that we fail a btree block allocation and hence fail > the extent allocation with a dirty btree and transaction. The dirty > btree causes the WANT_CORRUPTED_GOTO warning, and cancelling the > dirty transaction triggers the shutdown. > > The fix appears to be to ensure that we set the minleft parameter > correctly to reflect the potential number of btree blocks we still > need to allocate from the same AG if we are doing a worst case > split. By doing so, the particular case results in the first btree > block allocation setting minleft to the number of blocks needed for > a split. Hence the AG that we just allocated all the free blocks out > of for the data extent will fail because there are not enough free > blocks for all the blocks in the split in the AG. It will fail this > without dirtying anything, and because minleft is now > 0, will > trigger the retry algorithm. > > The fallback algorithm also needs a slight modification. It > currently assumes that if minleft is set, no allocation has been > done yet, so it can scan from AG 0 to find a free block. If it is > left like this, it can trigger deadlocks from the new case as we > have a prior allocation and hence cannot allocate from an AG less > than the current AG. Hence it is modified to use a START_AG > allocation to scan upwards from the current AG, hence avoiding the > known AG locking deadlocks. > > As far as I can tell, the bmap btree code has never handled this > case properly - I checked as far back as 1995 and minleft has never > been set to avoid selecting an AG that cannot be allocated out of... > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/xfs_bmap_btree.c | 51 ++++++++++++++++++++++++++++++++++++++-------- > 1 files changed, 42 insertions(+), 9 deletions(-) > > diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c > index 87d3c10..d5ef4e3 100644 > --- a/fs/xfs/xfs_bmap_btree.c > +++ b/fs/xfs/xfs_bmap_btree.c > @@ -538,6 +538,25 @@ xfs_bmbt_alloc_block( > args.type = XFS_ALLOCTYPE_START_BNO; > } else { > args.type = XFS_ALLOCTYPE_NEAR_BNO; > + > + /* > + * we've come in here because this is the second or subsequent > + * btree block we need to allocate for the bmap btree > + * modification. If we've just emptied the AG and there are > + * only free list blocks left, we need to make sure that we > + * take into account the minleft value that was reserved on the > + * first allocation through here (the NULLFSBLOCK branch > + * above). In that case, minleft will already take into account > + * the maximum number of blocks needed for a btree split, and > + * the number of blocks already allocated is recorded in the > + * cursor. From that, we can work out exactly how much smaller > + * the minleft should be so that we don't select an AG that > + * does not have enough blocks available to continue the entire > + * btree split. > + */ > + args.minleft = XFS_BM_MAXLEVELS(args.mp, > + (int)cur->bc_private.b.whichfork) - 1 - > + cur->bc_private.b.allocated; > } > > args.minlen = args.maxlen = args.prod = 1; > @@ -550,15 +569,29 @@ xfs_bmbt_alloc_block( > if (error) > goto error0; > > - if (args.fsbno == NULLFSBLOCK && args.minleft) { > - /* > - * Could not find an AG with enough free space to satisfy > - * a full btree split. Try again without minleft and if > - * successful activate the lowspace algorithm. > - */ > - args.fsbno = 0; > - args.type = XFS_ALLOCTYPE_FIRST_AG; > - args.minleft = 0; > + while (args.fsbno == NULLFSBLOCK && args.minleft) { > + if (cur->bc_private.b.firstblock == NULLFSBLOCK) { > + /* > + * Could not find an AG with enough free space to satisfy > + * a full btree split. Try again without minleft and if > + * successful activate the lowspace algorithm. > + */ > + args.type = XFS_ALLOCTYPE_FIRST_AG; > + args.fsbno = 0; > + args.minleft = 0; > + } else { > + /* > + * Failed to find enough space for a btree block after > + * a extent allocation has already occurred. Continue > + * searching other AGs that can hold the remaining > + * blocks. If we fail with minleft set, then clear it > + * and try again. > + */ > + args.type = XFS_ALLOCTYPE_START_AG; > + args.fsbno = cur->bc_private.b.firstblock; > + if (cur->bc_private.b.flist->xbf_low) > + args.minleft = 0; > + } > error = xfs_alloc_vextent(&args); > if (error) > goto error0; > -- > 1.7.1 > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs > -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs