On Mon, Apr 11, 2011 at 08:55:33PM -0400, Lachlan McIlroy wrote: > ----- Original Message ----- > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > When a multilevel bmbt split occurs, we can be asked to allocate > > blocks from an AG that has no space left available. In the case of > > an extent just being allocated, the first bmbt block allocation sees > > the firstblock parameter is set and does not set a minleft parameter > > for the allocation. The allocation also does not set the total > > number of blocks required by the allocation, either. > > > > If the extent allocation used all the free space in the AG, the lack > > of a total allocation size results in a block being reserved for the > > AG freespace btree manipulations being used incorrectly. > > > > Secondly, the allocation is only allowed to be allocated in the > > current AG (NEAR_BNO) because the low space algorithm has not been > > activated. As a result of this, the second block allocation in the > > split fails to find sufficient space in the current AG and fails, > > resulting in a dirty transaction cancellation and a filesytem > > shutdown. > > > > There are two problems here - both pointed out by Lachlan McIlroy > > when we were first discussing a proposed fix for the problem (See > > http://oss.sgi.com/archives/xfs/2010-09/msg00438.html). Firstly, > > the missing args->total was causing the initial block to be > > allocated from the freespace reserve. Secondly that the NEAR_BNO > > allocation should probably be a START_BNO allocation to allow blocks > > to be taken from a higher numbered AG. > > > > With these changes, the allocation failure goes away, but we trigger > > asserts in xfs_bmapi() that try to ensure that bmbt block > > allocations only occur in the same AG as the extent allocated, > > except if the low space algorithm is active. In this case, we don't > > need to activate the low space algorithm - if a START_BNO allocation > > fails with minleft = 0, then there is no space we can allocate at > > all, and hence the low space algorithm is no help. Therefore the > > asserts need modification reflect the new state of affairs. > > > > This commit fixes the problem demonstrated by the test case in > > xfstests 250. > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > --- > > fs/xfs/xfs_bmap.c | 8 ++------ > > fs/xfs/xfs_bmap_btree.c | 7 ++----- > > 2 files changed, 4 insertions(+), 11 deletions(-) > > > > diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c > > index fa00788..50eed97 100644 > > --- a/fs/xfs/xfs_bmap.c > > +++ b/fs/xfs/xfs_bmap.c > > @@ -4917,13 +4917,9 @@ error0: > > if (cur) { > > if (!error) { > > ASSERT(*firstblock == NULLFSBLOCK || > > - XFS_FSB_TO_AGNO(mp, *firstblock) == > > + XFS_FSB_TO_AGNO(mp, *firstblock) <= > > XFS_FSB_TO_AGNO(mp, > > - cur->bc_private.b.firstblock) || > > - (flist->xbf_low && > > - XFS_FSB_TO_AGNO(mp, *firstblock) < > > - XFS_FSB_TO_AGNO(mp, > > - cur->bc_private.b.firstblock))); > > + cur->bc_private.b.firstblock)); > > *firstblock = cur->bc_private.b.firstblock; > > } > > xfs_btree_del_cursor(cur, > > There's another ASSERT like this earlier in xfs_bmapi() after the call > to xfs_bmap_alloc() which may need the same attention. No, it checks the allocated extent against any existing firstblock and has nothing to do with the the bmbt allocations during extent insertion. > > > diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c > > index 87d3c10..b2cdbfe 100644 > > --- a/fs/xfs/xfs_bmap_btree.c > > +++ b/fs/xfs/xfs_bmap_btree.c > > @@ -518,10 +518,11 @@ xfs_bmbt_alloc_block( > > args.mp = cur->bc_mp; > > args.fsbno = cur->bc_private.b.firstblock; > > args.firstblock = args.fsbno; > > + args.total = 1; > > + args.type = XFS_ALLOCTYPE_START_BNO; > > > > if (args.fsbno == NULLFSBLOCK) { > > args.fsbno = be64_to_cpu(start->l); > > - args.type = XFS_ALLOCTYPE_START_BNO; > > /* > > * Make sure there is sufficient room left in the AG to > > * complete a full tree split for an extent insert. If > > @@ -534,10 +535,6 @@ xfs_bmbt_alloc_block( > > * block allocation here and corrupt the filesystem. > > */ > > args.minleft = xfs_trans_get_block_res(args.tp); > > - } else if (cur->bc_private.b.flist->xbf_low) { > > - args.type = XFS_ALLOCTYPE_START_BNO; > > - } else { > > - args.type = XFS_ALLOCTYPE_NEAR_BNO; > > } > > > > args.minlen = args.maxlen = args.prod = 1; > > I think that's exactly what I had in mind when we looked at this last > time but now I'm not so sure. I think I understand now why there is a > requirement that all subsequent allocations must (read 'can') come from > the same AG. > > The first time we try to allocate blocks we look for an AG with all the > space we need for the reservation to succeed. Of course, that explains the near bno allocation them. It's basically saying "we only ever have a reservation for a freespace btree split in a single AG". Ok, so it seems we need to retain that, and the problem of not being able to allocate in the current AG is elsewhere. Thanks for pointing that out to me, Lachlan - I should have realised that is what it was for in the first place. > If we don't find a single > AG that will satisfy the full request then we use the low space algorithm > and start from AG 0. But if we do find such an AG with all the space we > need it should be fine to use NEAR_BNO on subsequent allocations because > the space is there, right? But for the case we are dealing with here - bmbt insertion after an allocation - we already have firstblock set and hence never traverse the args.fsbno == NULLFSBLOCK case as the cursor firstblock is always set to the block that has already been allocated. Hence we never hit the low space algorithm unless it was triggered by the extent allocation. > If this logic fails to hold true then there are other problems we need to > deal with including; > > If we've already allocated a block from AG N then all subsequent > allocations must come from AGs >= N to prevent locking issues. If the > blocks we need are in AGs < N we're in trouble. Agreed - this is what minleft is trying to avoid from happening ahead of the first allocations. > We need to fall back to > the low space algorithm and start from AG 0 but we can't do this if we > have already allocated from AG N. Right, even the low space algorithm cannot avoid the need to avoid lock inversions in the AGs. > Also if we pass START_BNO to xfs_alloc_vextent() then it will fall > through to the ANY/START/FIRST_AG case and set flags to TRYLOCK. It That's not a problem. > will then scan AGs from N to the last AG and skip any that are already > locked. It may skip AGs that have the space we need and we may get to > the last AG without finding all required blocks. Sure, but.... > Then xfs_alloc_vextent() will loop through the AGs a second time if we > didn't get all we needed the first time but this time without the TRYLOCK. > This could result in locking AGs out of order. It loops the second time from the start AG, not AG zero so we should not be locking AGs out of order. > So why is the extent allocation using all of the remaining blocks in the > AG and not leaving any for the split? That's the big question, isn't it? I'll come back to this. > Are we not reserving the split > space with the conversion transaction? I don't think this is relevant - the problem is allocation, not extent conversion. > Is it because the space needed > for splits is not accounted for when delayed allocations are reserved so > we oversubscribe ourselves? This is preallocation into a hole, so delalloc is not involved here. > Or are we not activating the low space > algorithm early enough? There's plenty of free space, just not in the AG _after_ the extent allocation, and so there's no reason to trigger the low space algorithm. > I need to look at the code a bit more to understand all this. The same > allocation logic appears to be used in xfs_bmap_extents_to_btree(), > xfs_bmap_local_to_extents() and xfs_bmap_btalloc() so maybe there's some > logic in the logic. I think they are all fine, now that I think I understand the reason for the logic - that we only reserve enough blocks in the transaction for freespace tree splits in a single AG. New blocks in the bmbt can be triggered when: 1. extent conversion splits a record into two (or three) records. 2. extent removal split a record into two (punch a hole in the middle of an extent) 3. extent allocation inserts a new record 4. converting from extent to btree format Note that the local to extent change is actually a data extent allocation (same as xfs_bmap_btalloc), not a bmbt allocation. So, in the cases of #1 and #2, it is unlikely that there has been a previous allocation in the transaction, and so we take the first case of args.fsbno == NULLFSBLOCK to chose an AG with enough space for an entire bmbt split. Otherwise, it's the same case as #3 and #4. In the case of #3 and #4, the reason for the bmbt allocation is that we have allocated a new extent, and as a result firstblock _will_ be set. Hence we should have either space in the AG available for the bmbt split _or_ the low space algorithm is active. This problem case is that for #3/4 we can enter xfs_bmbt_alloc_block with the initial condition of firstblock set, not enough space in the current AG and the low space algorithm not active. For the test case (250), we still have lots of free space in the filesystem so we should not be activating the low space algorithm. That leaves the fact we are not leaving enough space in the AG for the bmbt split after the extent is allocated. AFAICT, this code in xfs_bmapi() handles it: 4452 if (wr && *firstblock == NULLFSBLOCK) { 4453 if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE) 4454 minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1; 4455 else 4456 minleft = 1; 4457 } else 4458 minleft = 0; So it seems to me that in all cases the setting of minleft is correct, as are the fallbacks that clear minleft then set xbf_low in xfs_bmap_btalloc(). So I don't think the bug actually resides at the xfs-bmapi/bmbt level. The question I'm now trying to answer is how we can allocate the entire AG if minleft is actually set. The answer, it appears, is in xfs_alloc_fix_minleft(). If we've allocated the largest extent possible, we'll have a situation where the only free space remaining is on the freelist. To use numbers relevant to the test case: agsize = 4096 blocks (16MB) max ext len = 4096 - 8 blocks = 4088 blocks AGF freeblks = 4088 AGF flcount = 4 minleft = 2; extent allocation length = 4088 diff = 4088 + 4 - 4088 - 2 = 2 And because diff >= 0, xfs_alloc_fix_minleft() says that the extent length is OK and does not trim it. Then when we go to allocate bmbt blocks, args->total is not set so the first block is allocated from the free list, and the second then fails at the xfs_alloc_fix_freelist checks because we don't have space available in the AG.... IOWs, we've set minleft appropriately, but xfs_alloc_fix_minleft() assumes that the AGFL blocks are part of the space available for the allocation. This, it seems, is wrong. If minleft is set, then we are not allowing the allocation to dip into any reserves, so we should not be considering the freelist blocks as available for minleft reservations. So, the patch below fixes the test 250 assert failure as well, and to me seems much more likely as the root cause of the bug. What do you think, Lachlan? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx --- fs/xfs/xfs_alloc.c | 1 - fs/xfs/xfs_bmap_btree.c | 1 + 2 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c index 27d64d7..8946464 100644 --- a/fs/xfs/xfs_alloc.c +++ b/fs/xfs/xfs_alloc.c @@ -280,7 +280,6 @@ xfs_alloc_fix_minleft( return 1; agf = XFS_BUF_TO_AGF(args->agbp); diff = be32_to_cpu(agf->agf_freeblks) - + be32_to_cpu(agf->agf_flcount) - args->len - args->minleft; if (diff >= 0) return 1; diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c index 87d3c10..470047e 100644 --- a/fs/xfs/xfs_bmap_btree.c +++ b/fs/xfs/xfs_bmap_btree.c @@ -518,6 +518,7 @@ xfs_bmbt_alloc_block( args.mp = cur->bc_mp; args.fsbno = cur->bc_private.b.firstblock; args.firstblock = args.fsbno; + args.total = 1; if (args.fsbno == NULLFSBLOCK) { args.fsbno = be64_to_cpu(start->l); _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs