On Wed, Oct 29, 2014 at 02:09:04PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > In phase 5 we rebuild the freelist btrees, the AGF and the AGFL from > all the free space information we have gathered and resolved during > pahses 3 and 4. If the freespace information is laid out just right, > we end up having to allocate free space for the AGFL blocks. > > If the size of the free space we allocate from is larger than the > space we need, then we have to insert the remainder back into the > freespace btree. For the by-size tree, this means we are likely to > be removing a record from one leaf, and then inserting the remainder > - a smaller size - into another leaf. > > The issue is that the leaf blocks to the left of the original leaf > block we removed the extent record from are full and hence require a > split to insert the new record. That, of course, requires a free > block in the AGFL to allocate from, and now we have a chicken and > egg situation: there are no free blocks in the AGFL because we are > setting it up. > > As a result, setting up the free list silently fails, leaving the > freespace btrees in an inconsistent state and the AGFL in question > empty. When the filesystem is next mounted, the first allocation > from that AGF results in attempting to fix the AGFL, and it then > does exactly the same thing as the repair code, fails to allocate a > block during the split and fails. This results in an immediate > shutdown because the transaction doing the allocation is dirty by > this stage. > > The fix for the problem is to make repair handle rebulding the btree > differently. If we leave ispace for a couple of records in each > btree leaf and node, there is never a need for a split to occur when > initially setting up the AGFL. This results in repair doing the > right thing, and hence the runtime problems after mount don't occur. > Further, add error checking the the AGFL setup code and abort repair > if we have a failure to correctly set up the AGFL so we catch this > problem at repair time, not mount time... > Interesting problem, thanks for the breakdown. I suppose the only interesting side effect is that the alloc btrees might require a bit more space after repair than before. I wonder if we need to take that into consideration here for certain cases. E.g., does this have ramifications for running repair on a clean, but completely full fs, or should that be generally handled by the existence of reserved blocks? A couple minor nits below... > Reported-by: Barkley Vowk <bvowk@xxxxxxx> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > repair/phase5.c | 44 +++++++++++++++++++++++++++++++++++--------- > 1 file changed, 35 insertions(+), 9 deletions(-) > > diff --git a/repair/phase5.c b/repair/phase5.c > index d6d3c6d..3d58936 100644 > --- a/repair/phase5.c > +++ b/repair/phase5.c > @@ -335,11 +335,22 @@ finish_cursor(bt_status_t *curs) > } > > /* > + * We need to leave some free records in the tree for the corner case of > + * setting up the AGFL. This may require allocation of blocks, and as > + * such can require insertion of new records into the tree (e.g. moving > + * a record in the by-count tree when a long extent is shortened). If we > + * pack the records into the leaves with no slack space, this requires a > + * leaf split to occur and a block to be allocated from the free list. > + * If we don't have any blocks on the free list (because we are setting > + * it up!), then we fail, and the filesystem will fail with the same > + * failure at runtime. Hence leave a couple of records slack space in > + * each block to allow immediate modification of the tree without > + * requiring splits to be done. > + * > * XXX(hch): any reason we don't just look at mp->m_alloc_mxr? I guess we could kill this comment now. ;) > */ > #define XR_ALLOC_BLOCK_MAXRECS(mp, level) \ > - xfs_allocbt_maxrecs((mp), (mp)->m_sb.sb_blocksize, \ > - (level) == 0) > + (xfs_allocbt_maxrecs((mp), (mp)->m_sb.sb_blocksize, (level) == 0) - 2) > > /* > * this calculates a freespace cursor for an ag. > @@ -361,10 +372,6 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > bt_stat_level_t *p_lptr; > extent_tree_node_t *ext_ptr; > int level; > -#ifdef XR_BLD_FREE_TRACE > - fprintf(stderr, > - "in init_freespace_cursor, agno = %d\n", agno); > -#endif > > num_extents = *extents; > extents_used = 0; > @@ -385,6 +392,13 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > lptr->num_recs_tot = num_extents; > level = 1; > > +#ifdef XR_BLD_FREE_TRACE > + fprintf(stderr, "%s 0 %d %d %d %d\n", __func__, What's the 0 for? Brian > + lptr->num_blocks, > + lptr->num_recs_pb, > + lptr->modulo, > + lptr->num_recs_tot); > +#endif > /* > * if we need more levels, set them up. # of records > * per level is the # of blocks in the level below it > @@ -402,6 +416,14 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > lptr->num_recs_pb = p_lptr->num_blocks > / lptr->num_blocks; > lptr->num_recs_tot = p_lptr->num_blocks; > +#ifdef XR_BLD_FREE_TRACE > + fprintf(stderr, "%s %d %d %d %d %d\n", __func__, > + level, > + lptr->num_blocks, > + lptr->num_recs_pb, > + lptr->modulo, > + lptr->num_recs_tot); > +#endif > } > } > > @@ -546,8 +568,7 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > lptr = &btree_curs->level[level]; > p_lptr = &btree_curs->level[level-1]; > lptr->num_blocks = howmany(p_lptr->num_blocks, > - XR_ALLOC_BLOCK_MAXRECS(mp, > - level)); > + XR_ALLOC_BLOCK_MAXRECS(mp, level)); > lptr->modulo = p_lptr->num_blocks > % lptr->num_blocks; > lptr->num_recs_pb = p_lptr->num_blocks > @@ -1434,6 +1455,7 @@ build_agf_agfl(xfs_mount_t *mp, > xfs_alloc_arg_t args; > xfs_trans_t *tp; > struct xfs_trans_res tres = {0}; > + int error; > > memset(&args, 0, sizeof(args)); > args.tp = tp = libxfs_trans_alloc(mp, 0); > @@ -1442,8 +1464,12 @@ build_agf_agfl(xfs_mount_t *mp, > args.alignment = 1; > args.pag = xfs_perag_get(mp,agno); > libxfs_trans_reserve(tp, &tres, XFS_MIN_FREELIST(agf, mp), 0); > - libxfs_alloc_fix_freelist(&args, 0); > + error = libxfs_alloc_fix_freelist(&args, 0); > xfs_perag_put(args.pag); > + if (error) { > + do_error(_("failed to fix AGFL on AG %d, error %d\n"), > + agno, error); > + } > libxfs_trans_commit(tp, 0); > } > > -- > 2.0.0 > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs