On Wed, Mar 04, 2020 at 01:21:44PM -0500, Brian Foster wrote: > On Tue, Mar 03, 2020 at 07:28:41PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > Add a new btree function that enables us to bulk load a btree cursor. > > This will be used by the upcoming online repair patches to generate new > > btrees. This avoids the programmatic inefficiency of calling > > xfs_btree_insert in a loop (which generates a lot of log traffic) in > > favor of stamping out new btree blocks with ordered buffers, and then > > committing both the new root and scheduling the removal of the old btree > > blocks in a single transaction commit. > > > > The design of this new generic code is based off the btree rebuilding > > code in xfs_repair's phase 5 code, with the explicit goal of enabling us > > to share that code between scrub and repair. It has the additional > > feature of being able to control btree block loading factors. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > --- > > fs/xfs/libxfs/xfs_btree.c | 581 +++++++++++++++++++++++++++++++++++++++++++++ > > fs/xfs/libxfs/xfs_btree.h | 46 ++++ > > fs/xfs/xfs_trace.c | 1 > > fs/xfs/xfs_trace.h | 85 +++++++ > > 4 files changed, 712 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c > > index 469e1e9053bb..c21db7ed8481 100644 > > --- a/fs/xfs/libxfs/xfs_btree.c > > +++ b/fs/xfs/libxfs/xfs_btree.c > > @@ -1324,7 +1324,7 @@ STATIC void > > xfs_btree_copy_ptrs( > > struct xfs_btree_cur *cur, > > union xfs_btree_ptr *dst_ptr, > > - union xfs_btree_ptr *src_ptr, > > + const union xfs_btree_ptr *src_ptr, > > int numptrs) > > { > > ASSERT(numptrs >= 0); > > @@ -5099,3 +5099,582 @@ xfs_btree_commit_ifakeroot( > > cur->bc_ops = ops; > > cur->bc_flags &= ~XFS_BTREE_STAGING; > > } > > + > > +/* > > + * Bulk Loading of Staged Btrees > > + * ============================= > > + * > > + * This interface is used with a staged btree cursor to create a totally new > > + * btree with a large number of records (i.e. more than what would fit in a > > + * single block). When the creation is complete, the new root can be linked > > + * atomically into the filesystem by committing the staged cursor. > > + * > > + * The first step for the caller is to construct a fake btree root structure > > + * and a staged btree cursor. A staging cursor contains all the geometry > > + * information for the btree type but will fail all operations that could have > > + * side effects in the filesystem (e.g. btree shape changes). Regular > > + * operations will not work unless the staging cursor is committed and becomes > > + * a regular cursor. > > + * > > + * For a btree rooted in an AG header, use an xbtree_afakeroot structure. > > + * This should be initialized to zero. For a btree rooted in an inode fork, > > + * use an xbtree_ifakeroot structure. @if_fork_size field should be set to > > + * the number of bytes available to the fork in the inode; @if_fork should > > + * point to a freshly allocated xfs_inode_fork; and @if_format should be set > > + * to the appropriate fork type (e.g. XFS_DINODE_FMT_BTREE). > > + * > > + * The next step for the caller is to initialize a struct xfs_btree_bload > > + * context. The @nr_records field is the number of records that are to be > > + * loaded into the btree. The @leaf_slack and @node_slack fields are the > > + * number of records (or key/ptr) slots to leave empty in new btree blocks. > > + * If a caller sets a slack value to -1, the slack value will be computed to > > + * fill the block halfway between minrecs and maxrecs items per block. > > + * > > + * The number of items placed in each btree block is computed via the following > > + * algorithm: For leaf levels, the number of items for the level is nr_records. > > + * For node levels, the number of items for the level is the number of blocks > > + * in the next lower level of the tree. For each level, the desired number of > > + * items per block is defined as: > > + * > > + * desired = max(minrecs, maxrecs - slack factor) > > + * > > + * The number of blocks for the level is defined to be: > > + * > > + * blocks = nr_items / desired > > + * > > + * Note this is rounded down so that the npb calculation below will never fall > > + * below minrecs. The number of items that will actually be loaded into each > > + * btree block is defined as: > > + * > > + * npb = nr_items / blocks > > + * > > + * Some of the leftmost blocks in the level will contain one extra record as > > + * needed to handle uneven division. If the number of records in any block > > + * would exceed maxrecs for that level, blocks is incremented and npb is > > + * recalculated. > > + * > > + * In other words, we compute the number of blocks needed to satisfy a given > > + * loading level, then spread the items as evenly as possible. > > + * > > + * To complete this step, call xfs_btree_bload_compute_geometry, which uses > > + * those settings to compute the height of the btree and the number of blocks > > + * that will be needed to construct the btree. These values are stored in the > > + * @btree_height and @nr_blocks fields. > > + * > > + * At this point, the caller must allocate @nr_blocks blocks and save them for > > + * later. If space is to be allocated transactionally, the staging cursor > > + * must be deleted before and recreated after, which is why computing the > > + * geometry is a separate step. > > + * Honestly, this whole block comment probably ought to be reorganized to present the six steps to bulk btree reconstruction and then have subsections to cover the tricky details of computing the geometry. Let me go work on that a bit. Here's a possible revision: /* * Bulk Loading of Staged Btrees * ============================= * * This interface is used with a staged btree cursor to create a totally new * btree with a large number of records (i.e. more than what would fit in a * single root block). When the creation is complete, the new root can be * linked atomically into the filesystem by committing the staged cursor. * * Creation of a new btree proceeds roughly as follows: * * The first step is to initialize an appropriate fake btree root structure and * then construct a staged btree cursor. Refer to the block comments about * "Bulk Loading for AG Btrees" and "Bulk Loading for Inode-Rooted Btrees" for * more information about how to do this. * * The second step is to initialize a struct xfs_btree_bload context as * follows: * * - nr_records is the number of records that are to be loaded into the btree. * * - leaf_slack is the number of records to leave empty in new leaf blocks. * * - node_slack is the number of key/ptr slots to leave empty in new node * blocks. * * If a caller sets a slack value to -1, that slack value will be computed to * fill the block halfway between minrecs and maxrecs items per block. * * - get_data is a function will be called for each record that will be loaded * into the btree. It must set the cursor's bc_rec field. Records returned * from this function /must/ be in sort order for the btree type, as they * are converted to on-disk format and written to disk in order! * * - alloc_block is a function that should return a pointer to one of the * blocks that are pre-allocated in step four. * * - For btrees which are rooted in an inode fork, iroot_size is a function * that will be called to compute the size of the incore btree root block. * * All other fields should be zero. * * The third step is to call xfs_btree_bload_compute_geometry to compute the * height of and the number of blocks needed to construct the btree. These * values are stored in the @btree_height and @nr_blocks fields of struct * xfs_btree_bload. See the section "Computing the Geometry of the New Btree" * for details about this computation. * * In step four, the caller must allocate xfs_btree_bload.nr_blocks blocks and * save them for later calls to alloc_block(). Bulk loading requires all * blocks to be allocated beforehand to avoid ENOSPC failures midway through a * rebuild, and to minimize seek distances of the new btree. * * If disk space is to be allocated transactionally, the staging cursor must be * deleted before allocation and recreated after. * * Step five is to call xfs_btree_bload() to start constructing the btree. * * The final step is to commit the staging cursor, which logs the new btree * root, turns the btree cursor into a regular btree cursor. The caller is * responsible for cleaning up the previous btree, if any. * * Computing the Geometry of the New Btree * ======================================= * * The number of items placed in each btree block is computed via the following * algorithm: For leaf levels, the number of items for the level is nr_records * in the bload structure. For node levels, the number of items for the level * is the number of blocks in the next lower level of the tree. For each * level, the desired number of items per block is defined as: * * desired = max(minrecs, maxrecs - slack factor) * * The number of blocks for the level is defined to be: * * blocks = floor(nr_items / desired) * * Note this is rounded down so that the npb calculation below will never fall * below minrecs. The number of items that will actually be loaded into each * btree block is defined as: * * npb = nr_items / blocks * * Some of the leftmost blocks in the level will contain one extra record as * needed to handle uneven division. If the number of records in any block * would exceed maxrecs for that level, blocks is incremented and npb is * recalculated. * * In other words, we compute the number of blocks needed to satisfy a given * loading level, then spread the items as evenly as possible. * * The height and number of fs blocks required to create the btree are computed * and returned via btree_height and nr_blocks. */ > I'm not following this ordering requirement wrt to the staging cursor..? I /think/ the reason I put that in there is because rolling the transaction in between space allocations can change sc->tp and there's no way to update the btree cursor to point to the new transaction. *However* on second thought I can't see why we would need or even want a transaction to be attached to the staging cursor during the rebuild process. Staging cursors can't do normal btree updates, and there's no need for a transaction since the new blocks are attached to a delwri list. So I think we can even rearrange the code here so that the _stage_cursor functions don't take a transaction at all, and only set bc_tp when we commit the new btree. > > + * The fourth step in the bulk loading process is to set the > > function pointers > > + * in the bload context structure. @get_data will be called for each record > > + * that will be loaded into the btree; it should set the cursor's bc_rec > > + * field, which will be converted to on-disk format and copied into the > > + * appropriate record slot. @alloc_block should supply one of the blocks > > + * allocated in the previous step. For btrees which are rooted in an inode > > + * fork, @iroot_size is called to compute the size of the incore btree root > > + * block. Call xfs_btree_bload to start constructing the btree. > > + * > > + * The final step is to commit the staging cursor, which logs the new btree > > + * root and turns the btree into a regular btree cursor, and free the fake > > + * roots. > > + */ > > + > > +/* > > + * Put a btree block that we're loading onto the ordered list and release it. > > + * The btree blocks will be written when the final transaction swapping the > > + * btree roots is committed. > > + */ > > +static void > > +xfs_btree_bload_drop_buf( > > + struct xfs_btree_bload *bbl, > > + struct xfs_trans *tp, > > + struct xfs_buf **bpp) > > +{ > > + if (*bpp == NULL) > > + return; > > + > > + xfs_buf_delwri_queue(*bpp, &bbl->buffers_list); > > + xfs_trans_brelse(tp, *bpp); > > + *bpp = NULL; > > +} > > + > > +/* Allocate and initialize one btree block for bulk loading. */ > > +STATIC int > > +xfs_btree_bload_prep_block( > > + struct xfs_btree_cur *cur, > > + struct xfs_btree_bload *bbl, > > + unsigned int level, > > + unsigned int nr_this_block, > > + union xfs_btree_ptr *ptrp, > > + struct xfs_buf **bpp, > > + struct xfs_btree_block **blockp, > > + void *priv) > > +{ > > Would help to have some one-line comments to describe the params. It > looks like some of these are the previous pointers, but are also > input/output..? Ok. "The new btree block will have its level and numrecs fields set to the values of the level and nr_this_block parameters, respectively. If bpp is set on entry, the buffer will be released. On exit, ptrp, bpp, and blockp will all point to the new block." > > + union xfs_btree_ptr new_ptr; > > + struct xfs_buf *new_bp; > > + struct xfs_btree_block *new_block; > > + int ret; > > + > > + if ((cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) && > > + level == cur->bc_nlevels - 1) { > > + struct xfs_ifork *ifp = cur->bc_private.b.ifake->if_fork; > > Wasn't a helper added for this cur -> ifp access? Yes. I'll go use that instead. > > + size_t new_size; > > + > > + /* Allocate a new incore btree root block. */ > > + new_size = bbl->iroot_size(cur, nr_this_block, priv); > > + ifp->if_broot = kmem_zalloc(new_size, 0); > > + ifp->if_broot_bytes = (int)new_size; > > + ifp->if_flags |= XFS_IFBROOT; > > + > > + /* Initialize it and send it out. */ > > + xfs_btree_init_block_int(cur->bc_mp, ifp->if_broot, > > + XFS_BUF_DADDR_NULL, cur->bc_btnum, level, > > + nr_this_block, cur->bc_private.b.ip->i_ino, > > + cur->bc_flags); > > + > > + *bpp = NULL; > > Is there no old bpp to drop here? Correct. We drop the buffer between levels, which means that when we prep the inode root, *bpp should already be NULL. However, I guess it won't hurt to xfs_btree_bload_drop_buf here just in case that ever changes. > > + *blockp = ifp->if_broot; > > + xfs_btree_set_ptr_null(cur, ptrp); > > + return 0; > > + } > > + > > + /* Allocate a new leaf block. */ > > + ret = bbl->alloc_block(cur, &new_ptr, priv); > > + if (ret) > > + return ret; > > + > > + ASSERT(!xfs_btree_ptr_is_null(cur, &new_ptr)); > > + > > + ret = xfs_btree_get_buf_block(cur, &new_ptr, &new_block, &new_bp); > > + if (ret) > > + return ret; > > + > > + /* Initialize the btree block. */ > > + xfs_btree_init_block_cur(cur, new_bp, level, nr_this_block); > > + if (*blockp) > > + xfs_btree_set_sibling(cur, *blockp, &new_ptr, XFS_BB_RIGHTSIB); > > + xfs_btree_set_sibling(cur, new_block, ptrp, XFS_BB_LEFTSIB); > > + xfs_btree_set_numrecs(new_block, nr_this_block); > > I think numrecs is already set by the init_block_cur() call above. Yes. Fixed. > > + > > + /* Release the old block and set the out parameters. */ > > + xfs_btree_bload_drop_buf(bbl, cur->bc_tp, bpp); > > + *blockp = new_block; > > + *bpp = new_bp; > > + xfs_btree_copy_ptrs(cur, ptrp, &new_ptr, 1); > > + return 0; > > +} > > + > > +/* Load one leaf block. */ > > +STATIC int > > +xfs_btree_bload_leaf( > > + struct xfs_btree_cur *cur, > > + unsigned int recs_this_block, > > + xfs_btree_bload_get_fn get_data, > > + struct xfs_btree_block *block, > > + void *priv) > > +{ > > + unsigned int j; > > + int ret; > > + > > + /* Fill the leaf block with records. */ > > + for (j = 1; j <= recs_this_block; j++) { > > + union xfs_btree_rec *block_recs; > > + > > s/block_recs/block_rec/ ? Fixed. > > + ret = get_data(cur, priv); > > + if (ret) > > + return ret; > > + block_recs = xfs_btree_rec_addr(cur, j, block); > > + cur->bc_ops->init_rec_from_cur(cur, block_recs); > > + } > > + > > + return 0; > > +} > > + > > +/* Load one node block. */ > > More comments here to document the child_ptr please.. "child_ptr must point to a block within the next level down in the tree. A key/ptr entry will be created in the new node block to the block pointed to by child_ptr. On exit, child_ptr will be advanced to where it needs to be to start the next _bload_node call." > > +STATIC int > > +xfs_btree_bload_node( > > + struct xfs_btree_cur *cur, > > + unsigned int recs_this_block, > > + union xfs_btree_ptr *child_ptr, > > + struct xfs_btree_block *block) > > +{ > > + unsigned int j; > > + int ret; > > + > > + /* Fill the node block with keys and pointers. */ > > + for (j = 1; j <= recs_this_block; j++) { > > + union xfs_btree_key child_key; > > + union xfs_btree_ptr *block_ptr; > > + union xfs_btree_key *block_key; > > + struct xfs_btree_block *child_block; > > + struct xfs_buf *child_bp; > > + > > + ASSERT(!xfs_btree_ptr_is_null(cur, child_ptr)); > > + > > + ret = xfs_btree_get_buf_block(cur, child_ptr, &child_block, > > + &child_bp); > > + if (ret) > > + return ret; > > + > > + xfs_btree_get_keys(cur, child_block, &child_key); > > Any reason this isn't pushed down a couple lines with the key copy code? No reason. > > + > > + block_ptr = xfs_btree_ptr_addr(cur, j, block); > > + xfs_btree_copy_ptrs(cur, block_ptr, child_ptr, 1); > > + > > + block_key = xfs_btree_key_addr(cur, j, block); > > + xfs_btree_copy_keys(cur, block_key, &child_key, 1); > > + > > + xfs_btree_get_sibling(cur, child_block, child_ptr, > > + XFS_BB_RIGHTSIB); > > + xfs_trans_brelse(cur->bc_tp, child_bp); > > + } > > + > > + return 0; > > +} > > + > > +/* > > + * Compute the maximum number of records (or keyptrs) per block that we want to > > + * install at this level in the btree. Caller is responsible for having set > > + * @cur->bc_private.b.forksize to the desired fork size, if appropriate. > > + */ > > +STATIC unsigned int > > +xfs_btree_bload_max_npb( > > + struct xfs_btree_cur *cur, > > + struct xfs_btree_bload *bbl, > > + unsigned int level) > > +{ > > + unsigned int ret; > > + > > + if (level == cur->bc_nlevels - 1 && cur->bc_ops->get_dmaxrecs) > > + return cur->bc_ops->get_dmaxrecs(cur, level); > > + > > + ret = cur->bc_ops->get_maxrecs(cur, level); > > + if (level == 0) > > + ret -= bbl->leaf_slack; > > + else > > + ret -= bbl->node_slack; > > + return ret; > > +} > > + > > +/* > > + * Compute the desired number of records (or keyptrs) per block that we want to > > + * install at this level in the btree, which must be somewhere between minrecs > > + * and max_npb. The caller is free to install fewer records per block. > > + */ > > +STATIC unsigned int > > +xfs_btree_bload_desired_npb( > > + struct xfs_btree_cur *cur, > > + struct xfs_btree_bload *bbl, > > + unsigned int level) > > +{ > > + unsigned int npb = xfs_btree_bload_max_npb(cur, bbl, level); > > + > > + /* Root blocks are not subject to minrecs rules. */ > > + if (level == cur->bc_nlevels - 1) > > + return max(1U, npb); > > + > > + return max_t(unsigned int, cur->bc_ops->get_minrecs(cur, level), npb); > > +} > > + > > +/* > > + * Compute the number of records to be stored in each block at this level and > > + * the number of blocks for this level. For leaf levels, we must populate an > > + * empty root block even if there are no records, so we have to have at least > > + * one block. > > + */ > > +STATIC void > > +xfs_btree_bload_level_geometry( > > + struct xfs_btree_cur *cur, > > + struct xfs_btree_bload *bbl, > > + unsigned int level, > > + uint64_t nr_this_level, > > + unsigned int *avg_per_block, > > + uint64_t *blocks, > > + uint64_t *blocks_with_extra) > > +{ > > + uint64_t npb; > > + uint64_t dontcare; > > + unsigned int desired_npb; > > + unsigned int maxnr; > > + > > + maxnr = cur->bc_ops->get_maxrecs(cur, level); > > + > > + /* > > + * Compute the number of blocks we need to fill each block with the > > + * desired number of records/keyptrs per block. Because desired_npb > > + * could be minrecs, we use regular integer division (which rounds > > + * the block count down) so that in the next step the effective # of > > + * items per block will never be less than desired_npb. > > + */ > > + desired_npb = xfs_btree_bload_desired_npb(cur, bbl, level); > > + *blocks = div64_u64_rem(nr_this_level, desired_npb, &dontcare); > > + *blocks = max(1ULL, *blocks); > > + > > + /* > > + * Compute the number of records that we will actually put in each > > + * block, assuming that we want to spread the records evenly between > > + * the blocks. Take care that the effective # of items per block (npb) > > + * won't exceed maxrecs even for the blocks that get an extra record, > > + * since desired_npb could be maxrecs, and in the previous step we > > + * rounded the block count down. > > + */ > > + npb = div64_u64_rem(nr_this_level, *blocks, blocks_with_extra); > > + if (npb > maxnr || (npb == maxnr && *blocks_with_extra > 0)) { > > + (*blocks)++; > > + npb = div64_u64_rem(nr_this_level, *blocks, blocks_with_extra); > > + } > > + > > + *avg_per_block = min_t(uint64_t, npb, nr_this_level); > > + > > + trace_xfs_btree_bload_level_geometry(cur, level, nr_this_level, > > + *avg_per_block, desired_npb, *blocks, > > + *blocks_with_extra); > > +} > > + > > +/* > > + * Ensure a slack value is appropriate for the btree. > > + * > > + * If the slack value is negative, set slack so that we fill the block to > > + * halfway between minrecs and maxrecs. Make sure the slack is never so large > > + * that we can underflow minrecs. > > + */ > > +static void > > +xfs_btree_bload_ensure_slack( > > + struct xfs_btree_cur *cur, > > + int *slack, > > + int level) > > +{ > > + int maxr; > > + int minr; > > + > > + /* > > + * We only care about slack for btree blocks, so set the btree nlevels > > + * to 3 so that level 0 is a leaf block and level 1 is a node block. > > + * Avoid straying into inode roots, since we don't do slack there. > > + */ > > + cur->bc_nlevels = 3; > > Ok, but what does this assignment do as it relates to the code? It seems > this is related to this function as it is overwritten by the caller... Hm, I'm not 100% sure what you're confused about -- what does "as it relates to the code" mean? In any case, we're creating an artificial btree geometry here so that we can measure min and maxrecs for a given level, and setting slack based on that. "3" is the magic value so that we always get min/max recs for a level that consists of fs blocks (as opposed to inode roots). We don't have to preserve the old value since we're about to compute the real one. Hmm, maybe you're wondering why we're setting nlevels = 3 here instead of in the caller? That might be a good idea... > > + maxr = cur->bc_ops->get_maxrecs(cur, level); > > + minr = cur->bc_ops->get_minrecs(cur, level); > > + > > + /* > > + * If slack is negative, automatically set slack so that we load the > > + * btree block approximately halfway between minrecs and maxrecs. > > + * Generally, this will net us 75% loading. > > + */ > > + if (*slack < 0) > > + *slack = maxr - ((maxr + minr) >> 1); > > + > > + *slack = min(*slack, maxr - minr); > > +} > > + > > +/* > > + * Prepare a btree cursor for a bulk load operation by computing the geometry > > + * fields in @bbl. Caller must ensure that the btree cursor is a staging > > + * cursor. This function can be called multiple times. > > + */ > > +int > > +xfs_btree_bload_compute_geometry( > > + struct xfs_btree_cur *cur, > > + struct xfs_btree_bload *bbl, > > + uint64_t nr_records) > > +{ > > + uint64_t nr_blocks = 0; > > + uint64_t nr_this_level; > > + > > + ASSERT(cur->bc_flags & XFS_BTREE_STAGING); > > + ...so then this becomes: /* * Make sure that the slack values make sense for btree blocks * that are full disk blocks by setting the btree nlevels to 3. * We don't try to enforce slack for inode roots. */ cur->bc_nlevels = 3; xfs_btree_bload_ensure_slack(cur, &bbl->leaf_slack, 0); xfs_btree_bload_ensure_slack(cur, &bbl->node_slack, 1); > > + xfs_btree_bload_ensure_slack(cur, &bbl->leaf_slack, 0); > > + xfs_btree_bload_ensure_slack(cur, &bbl->node_slack, 1); > > + > > + bbl->nr_records = nr_this_level = nr_records; > > I found nr_this_level a bit vague of a name when reading through the > code below. Perhaps level_recs is a bit more clear..? > > > + for (cur->bc_nlevels = 1; cur->bc_nlevels < XFS_BTREE_MAXLEVELS;) { > > + uint64_t level_blocks; > > + uint64_t dontcare64; > > + unsigned int level = cur->bc_nlevels - 1; > > + unsigned int avg_per_block; > > + > > + /* > > + * If all the things we want to store at this level would fit > > + * in a single root block, then we have our btree root and are > > + * done. Note that bmap btrees do not allow records in the > > + * root. > > + */ > > + if (!(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) || level != 0) { > > + xfs_btree_bload_level_geometry(cur, bbl, level, > > + nr_this_level, &avg_per_block, > > + &level_blocks, &dontcare64); > > + if (nr_this_level <= avg_per_block) { > > + nr_blocks++; > > + break; > > + } > > + } > > + > > + /* > > + * Otherwise, we have to store all the records for this level > > + * in blocks and therefore need another level of btree to point > > + * to those blocks. Increase the number of levels and > > + * recompute the number of records we can store at this level > > + * because that can change depending on whether or not a level > > + * is the root level. > > + */ > > + cur->bc_nlevels++; > > Hmm.. so does the ->bc_nlevels increment affect the > _bload_level_geometry() call or is it just part of the loop iteration? > If the latter, can these two _bload_level_geometry() calls be combined? It affects the xfs_btree_bload_level_geometry call because that calls ->get_maxrecs(), which returns a different answer for the root level when the root is an inode fork. Therefore, we cannot combine the calls. > > > + xfs_btree_bload_level_geometry(cur, bbl, level, nr_this_level, > > + &avg_per_block, &level_blocks, &dontcare64); > > + nr_blocks += level_blocks; > > + nr_this_level = level_blocks; > > + } > > + > > + if (cur->bc_nlevels == XFS_BTREE_MAXLEVELS) > > + return -EOVERFLOW; > > + > > + bbl->btree_height = cur->bc_nlevels; > > + if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) > > + bbl->nr_blocks = nr_blocks - 1; > > + else > > + bbl->nr_blocks = nr_blocks; > > + return 0; > > +} > > + > > +/* > > + * Bulk load a btree. > > + * > > + * Load @bbl->nr_records quantity of records into a btree using the supplied > > + * empty and staging btree cursor @cur and a @bbl that has been filled out by > > + * the xfs_btree_bload_compute_geometry function. > > + * > > + * The @bbl->get_data function must populate the cursor's bc_rec every time it > > + * is called. The @bbl->alloc_block function will be used to allocate new > > + * btree blocks. @priv is passed to both functions. > > + * > > + * Caller must ensure that @cur is a staging cursor. Any existing btree rooted > > + * in the fakeroot will be lost, so do not call this function twice. > > + */ > > +int > > +xfs_btree_bload( > > + struct xfs_btree_cur *cur, > > + struct xfs_btree_bload *bbl, > > + void *priv) > > +{ > > + union xfs_btree_ptr child_ptr; > > + union xfs_btree_ptr ptr; > > + struct xfs_buf *bp = NULL; > > + struct xfs_btree_block *block = NULL; > > + uint64_t nr_this_level = bbl->nr_records; > > + uint64_t blocks; > > + uint64_t i; > > + uint64_t blocks_with_extra; > > + uint64_t total_blocks = 0; > > + unsigned int avg_per_block; > > + unsigned int level = 0; > > + int ret; > > + > > + ASSERT(cur->bc_flags & XFS_BTREE_STAGING); > > + > > + INIT_LIST_HEAD(&bbl->buffers_list); > > + cur->bc_nlevels = bbl->btree_height; > > + xfs_btree_set_ptr_null(cur, &child_ptr); > > + xfs_btree_set_ptr_null(cur, &ptr); > > + > > + xfs_btree_bload_level_geometry(cur, bbl, level, nr_this_level, > > + &avg_per_block, &blocks, &blocks_with_extra); > > + > > + /* Load each leaf block. */ > > + for (i = 0; i < blocks; i++) { > > + unsigned int nr_this_block = avg_per_block; > > + > > + if (i < blocks_with_extra) > > + nr_this_block++; > > + > > + ret = xfs_btree_bload_prep_block(cur, bbl, level, > > + nr_this_block, &ptr, &bp, &block, priv); > > + if (ret) > > + return ret; > > + > > + trace_xfs_btree_bload_block(cur, level, i, blocks, &ptr, > > + nr_this_block); > > + > > + ret = xfs_btree_bload_leaf(cur, nr_this_block, bbl->get_data, > > + block, priv); > > + if (ret) > > + goto out; > > + > > + /* Record the leftmost pointer to start the next level. */ > > + if (i == 0) > > + xfs_btree_copy_ptrs(cur, &child_ptr, &ptr, 1); > > "leftmost pointer" refers to the leftmost leaf block..? Yes. "Record the leftmost leaf pointer so we know where to start with the first node level." ? > > + } > > + total_blocks += blocks; > > + xfs_btree_bload_drop_buf(bbl, cur->bc_tp, &bp); > > + > > + /* Populate the internal btree nodes. */ > > + for (level = 1; level < cur->bc_nlevels; level++) { > > + union xfs_btree_ptr first_ptr; > > + > > + nr_this_level = blocks; > > + block = NULL; > > + xfs_btree_set_ptr_null(cur, &ptr); > > + > > + xfs_btree_bload_level_geometry(cur, bbl, level, nr_this_level, > > + &avg_per_block, &blocks, &blocks_with_extra); > > + > > + /* Load each node block. */ > > + for (i = 0; i < blocks; i++) { > > + unsigned int nr_this_block = avg_per_block; > > + > > + if (i < blocks_with_extra) > > + nr_this_block++; > > + > > + ret = xfs_btree_bload_prep_block(cur, bbl, level, > > + nr_this_block, &ptr, &bp, &block, > > + priv); > > + if (ret) > > + return ret; > > + > > + trace_xfs_btree_bload_block(cur, level, i, blocks, > > + &ptr, nr_this_block); > > + > > + ret = xfs_btree_bload_node(cur, nr_this_block, > > + &child_ptr, block); > > + if (ret) > > + goto out; > > + > > + /* > > + * Record the leftmost pointer to start the next level. > > + */ > > And the same thing here. I think the generic ptr name is a little > confusing, though I don't have a better suggestion. I think it would > help if the comments were more explicit to say something like: "ptr > refers to the current block addr. Save the first block in the current > level so the next level up knows where to start looking for keys." Yes, I'll do that: "Record the leftmost node pointer so that we know where to start the next node level above this one." Thanks for reviewing! --D > Brian > > > + if (i == 0) > > + xfs_btree_copy_ptrs(cur, &first_ptr, &ptr, 1); > > + } > > + total_blocks += blocks; > > + xfs_btree_bload_drop_buf(bbl, cur->bc_tp, &bp); > > + xfs_btree_copy_ptrs(cur, &child_ptr, &first_ptr, 1); > > + } > > + > > + /* Initialize the new root. */ > > + if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) { > > + ASSERT(xfs_btree_ptr_is_null(cur, &ptr)); > > + cur->bc_private.b.ifake->if_levels = cur->bc_nlevels; > > + cur->bc_private.b.ifake->if_blocks = total_blocks - 1; > > + } else { > > + cur->bc_private.a.afake->af_root = be32_to_cpu(ptr.s); > > + cur->bc_private.a.afake->af_levels = cur->bc_nlevels; > > + cur->bc_private.a.afake->af_blocks = total_blocks; > > + } > > + > > + /* > > + * Write the new blocks to disk. If the ordered list isn't empty after > > + * that, then something went wrong and we have to fail. This should > > + * never happen, but we'll check anyway. > > + */ > > + ret = xfs_buf_delwri_submit(&bbl->buffers_list); > > + if (ret) > > + goto out; > > + if (!list_empty(&bbl->buffers_list)) { > > + ASSERT(list_empty(&bbl->buffers_list)); > > + ret = -EIO; > > + } > > +out: > > + xfs_buf_delwri_cancel(&bbl->buffers_list); > > + if (bp) > > + xfs_trans_brelse(cur->bc_tp, bp); > > + return ret; > > +} > > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h > > index 2965ed663418..51720de366ae 100644 > > --- a/fs/xfs/libxfs/xfs_btree.h > > +++ b/fs/xfs/libxfs/xfs_btree.h > > @@ -578,4 +578,50 @@ void xfs_btree_stage_ifakeroot(struct xfs_btree_cur *cur, > > void xfs_btree_commit_ifakeroot(struct xfs_btree_cur *cur, int whichfork, > > const struct xfs_btree_ops *ops); > > > > +typedef int (*xfs_btree_bload_get_fn)(struct xfs_btree_cur *cur, void *priv); > > +typedef int (*xfs_btree_bload_alloc_block_fn)(struct xfs_btree_cur *cur, > > + union xfs_btree_ptr *ptr, void *priv); > > +typedef size_t (*xfs_btree_bload_iroot_size_fn)(struct xfs_btree_cur *cur, > > + unsigned int nr_this_level, void *priv); > > + > > +/* Bulk loading of staged btrees. */ > > +struct xfs_btree_bload { > > + /* Buffer list for delwri_queue. */ > > + struct list_head buffers_list; > > + > > + /* Function to store a record in the cursor. */ > > + xfs_btree_bload_get_fn get_data; > > + > > + /* Function to allocate a block for the btree. */ > > + xfs_btree_bload_alloc_block_fn alloc_block; > > + > > + /* Function to compute the size of the in-core btree root block. */ > > + xfs_btree_bload_iroot_size_fn iroot_size; > > + > > + /* Number of records the caller wants to store. */ > > + uint64_t nr_records; > > + > > + /* Number of btree blocks needed to store those records. */ > > + uint64_t nr_blocks; > > + > > + /* > > + * Number of free records to leave in each leaf block. If this (or > > + * any of the slack values) are negative, this will be computed to > > + * be halfway between maxrecs and minrecs. This typically leaves the > > + * block 75% full. > > + */ > > + int leaf_slack; > > + > > + /* Number of free keyptrs to leave in each node block. */ > > + int node_slack; > > + > > + /* Computed btree height. */ > > + unsigned int btree_height; > > +}; > > + > > +int xfs_btree_bload_compute_geometry(struct xfs_btree_cur *cur, > > + struct xfs_btree_bload *bbl, uint64_t nr_records); > > +int xfs_btree_bload(struct xfs_btree_cur *cur, struct xfs_btree_bload *bbl, > > + void *priv); > > + > > #endif /* __XFS_BTREE_H__ */ > > diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c > > index bc85b89f88ca..9b5e58a92381 100644 > > --- a/fs/xfs/xfs_trace.c > > +++ b/fs/xfs/xfs_trace.c > > @@ -6,6 +6,7 @@ > > #include "xfs.h" > > #include "xfs_fs.h" > > #include "xfs_shared.h" > > +#include "xfs_bit.h" > > #include "xfs_format.h" > > #include "xfs_log_format.h" > > #include "xfs_trans_resv.h" > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > index 7e162ca80c92..69e8605f9f97 100644 > > --- a/fs/xfs/xfs_trace.h > > +++ b/fs/xfs/xfs_trace.h > > @@ -35,6 +35,7 @@ struct xfs_icreate_log; > > struct xfs_owner_info; > > struct xfs_trans_res; > > struct xfs_inobt_rec_incore; > > +union xfs_btree_ptr; > > > > DECLARE_EVENT_CLASS(xfs_attr_list_class, > > TP_PROTO(struct xfs_attr_list_context *ctx), > > @@ -3655,6 +3656,90 @@ TRACE_EVENT(xfs_btree_commit_ifakeroot, > > __entry->blocks) > > ) > > > > +TRACE_EVENT(xfs_btree_bload_level_geometry, > > + TP_PROTO(struct xfs_btree_cur *cur, unsigned int level, > > + uint64_t nr_this_level, unsigned int nr_per_block, > > + unsigned int desired_npb, uint64_t blocks, > > + uint64_t blocks_with_extra), > > + TP_ARGS(cur, level, nr_this_level, nr_per_block, desired_npb, blocks, > > + blocks_with_extra), > > + TP_STRUCT__entry( > > + __field(dev_t, dev) > > + __field(xfs_btnum_t, btnum) > > + __field(unsigned int, level) > > + __field(unsigned int, nlevels) > > + __field(uint64_t, nr_this_level) > > + __field(unsigned int, nr_per_block) > > + __field(unsigned int, desired_npb) > > + __field(unsigned long long, blocks) > > + __field(unsigned long long, blocks_with_extra) > > + ), > > + TP_fast_assign( > > + __entry->dev = cur->bc_mp->m_super->s_dev; > > + __entry->btnum = cur->bc_btnum; > > + __entry->level = level; > > + __entry->nlevels = cur->bc_nlevels; > > + __entry->nr_this_level = nr_this_level; > > + __entry->nr_per_block = nr_per_block; > > + __entry->desired_npb = desired_npb; > > + __entry->blocks = blocks; > > + __entry->blocks_with_extra = blocks_with_extra; > > + ), > > + TP_printk("dev %d:%d btree %s level %u/%u nr_this_level %llu nr_per_block %u desired_npb %u blocks %llu blocks_with_extra %llu", > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > + __print_symbolic(__entry->btnum, XFS_BTNUM_STRINGS), > > + __entry->level, > > + __entry->nlevels, > > + __entry->nr_this_level, > > + __entry->nr_per_block, > > + __entry->desired_npb, > > + __entry->blocks, > > + __entry->blocks_with_extra) > > +) > > + > > +TRACE_EVENT(xfs_btree_bload_block, > > + TP_PROTO(struct xfs_btree_cur *cur, unsigned int level, > > + uint64_t block_idx, uint64_t nr_blocks, > > + union xfs_btree_ptr *ptr, unsigned int nr_records), > > + TP_ARGS(cur, level, block_idx, nr_blocks, ptr, nr_records), > > + TP_STRUCT__entry( > > + __field(dev_t, dev) > > + __field(xfs_btnum_t, btnum) > > + __field(unsigned int, level) > > + __field(unsigned long long, block_idx) > > + __field(unsigned long long, nr_blocks) > > + __field(xfs_agnumber_t, agno) > > + __field(xfs_agblock_t, agbno) > > + __field(unsigned int, nr_records) > > + ), > > + TP_fast_assign( > > + __entry->dev = cur->bc_mp->m_super->s_dev; > > + __entry->btnum = cur->bc_btnum; > > + __entry->level = level; > > + __entry->block_idx = block_idx; > > + __entry->nr_blocks = nr_blocks; > > + if (cur->bc_flags & XFS_BTREE_LONG_PTRS) { > > + xfs_fsblock_t fsb = be64_to_cpu(ptr->l); > > + > > + __entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsb); > > + __entry->agbno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsb); > > + } else { > > + __entry->agno = cur->bc_private.a.agno; > > + __entry->agbno = be32_to_cpu(ptr->s); > > + } > > + __entry->nr_records = nr_records; > > + ), > > + TP_printk("dev %d:%d btree %s level %u block %llu/%llu fsb (%u/%u) recs %u", > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > + __print_symbolic(__entry->btnum, XFS_BTNUM_STRINGS), > > + __entry->level, > > + __entry->block_idx, > > + __entry->nr_blocks, > > + __entry->agno, > > + __entry->agbno, > > + __entry->nr_records) > > +) > > + > > #endif /* _TRACE_XFS_H */ > > > > #undef TRACE_INCLUDE_PATH > > >