On Wed, Jun 22, 2016 at 11:17:06AM -0400, Brian Foster wrote: > On Thu, Jun 16, 2016 at 06:19:15PM -0700, Darrick J. Wong wrote: > > On a filesystem with both reflink and reverse mapping enabled, it's > > possible to have multiple rmap records referring to the same blocks on > > disk. When overlapping intervals are possible, querying a classic > > btree to find all records intersecting a given interval is inefficient > > because we cannot use the left side of the search interval to filter > > out non-matching records the same way that we can use the existing > > btree key to filter out records coming after the right side of the > > search interval. This will become important once we want to use the > > rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl. > > > > (For the non-overlapping case, we can perform such queries trivially > > by starting at the left side of the interval and walking the tree > > until we pass the right side.) > > > > Therefore, extend the btree code to come closer to supporting > > intervals as a first-class record attribute. This involves widening > > the btree node's key space to store both the lowest key reachable via > > the node pointer (as the btree does now) and the highest key reachable > > via the same pointer and teaching the btree modifying functions to > > keep the highest-key records up to date. > > > > This behavior can be turned on via a new btree ops flag so that btrees > > that cannot store overlapping intervals don't pay the overhead costs > > in terms of extra code and disk format changes. > > > > v2: When we're deleting a record in a btree that supports overlapped > > interval records and the deletion results in two btree blocks being > > joined, we defer updating the high/low keys until after all possible > > joining (at higher levels in the tree) have finished. At this point, > > the btree pointers at all levels have been updated to remove the empty > > blocks and we can update the low and high keys. > > > > When we're doing this, we must be careful to update the keys of all > > node pointers up to the root instead of stopping at the first set of > > keys that don't need updating. This is because it's possible for a > > single deletion to cause joining of multiple levels of tree, and so > > we need to update everything going back to the root. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > --- > > I think I get the gist of this and it mostly looks Ok to me. A few > questions and minor comments... Ok. > > fs/xfs/libxfs/xfs_btree.c | 379 +++++++++++++++++++++++++++++++++++++++++---- > > fs/xfs/libxfs/xfs_btree.h | 16 ++ > > fs/xfs/xfs_trace.h | 36 ++++ > > 3 files changed, 395 insertions(+), 36 deletions(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c > > index a096539..afcafd6 100644 > > --- a/fs/xfs/libxfs/xfs_btree.c > > +++ b/fs/xfs/libxfs/xfs_btree.c > > @@ -52,6 +52,11 @@ static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = { > > xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum] > > > > > > +struct xfs_btree_double_key { > > + union xfs_btree_key low; > > + union xfs_btree_key high; > > +}; > > + > > STATIC int /* error (0 or EFSCORRUPTED) */ > > xfs_btree_check_lblock( > > struct xfs_btree_cur *cur, /* btree cursor */ > > @@ -428,6 +433,30 @@ xfs_btree_dup_cursor( > > * into a btree block (xfs_btree_*_offset) or return a pointer to the given > > * record, key or pointer (xfs_btree_*_addr). Note that all addressing > > * inside the btree block is done using indices starting at one, not zero! > > + * > > + * If XFS_BTREE_OVERLAPPING is set, then this btree supports keys containing > > + * overlapping intervals. In such a tree, records are still sorted lowest to > > + * highest and indexed by the smallest key value that refers to the record. > > + * However, nodes are different: each pointer has two associated keys -- one > > + * indexing the lowest key available in the block(s) below (the same behavior > > + * as the key in a regular btree) and another indexing the highest key > > + * available in the block(s) below. Because records are /not/ sorted by the > > + * highest key, all leaf block updates require us to compute the highest key > > + * that matches any record in the leaf and to recursively update the high keys > > + * in the nodes going further up in the tree, if necessary. Nodes look like > > + * this: > > + * > > + * +--------+-----+-----+-----+-----+-----+-------+-------+-----+ > > + * Non-Leaf: | header | lo1 | hi1 | lo2 | hi2 | ... | ptr 1 | ptr 2 | ... | > > + * +--------+-----+-----+-----+-----+-----+-------+-------+-----+ > > + * > > + * To perform an interval query on an overlapped tree, perform the usual > > + * depth-first search and use the low and high keys to decide if we can skip > > + * that particular node. If a leaf node is reached, return the records that > > + * intersect the interval. Note that an interval query may return numerous > > + * entries. For a non-overlapped tree, simply search for the record associated > > + * with the lowest key and iterate forward until a non-matching record is > > + * found. > > */ > > > > /* > > @@ -445,6 +474,17 @@ static inline size_t xfs_btree_block_len(struct xfs_btree_cur *cur) > > return XFS_BTREE_SBLOCK_LEN; > > } > > > > +/* Return size of btree block keys for this btree instance. */ > > +static inline size_t xfs_btree_key_len(struct xfs_btree_cur *cur) > > +{ > > + size_t len; > > + > > + len = cur->bc_ops->key_len; > > + if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING) > > + len *= 2; > > + return len; > > +} > > + > > /* > > * Return size of btree block pointers for this btree instance. > > */ > > @@ -475,7 +515,19 @@ xfs_btree_key_offset( > > int n) > > { > > return xfs_btree_block_len(cur) + > > - (n - 1) * cur->bc_ops->key_len; > > + (n - 1) * xfs_btree_key_len(cur); > > +} > > + > > +/* > > + * Calculate offset of the n-th high key in a btree block. > > + */ > > +STATIC size_t > > +xfs_btree_high_key_offset( > > + struct xfs_btree_cur *cur, > > + int n) > > +{ > > + return xfs_btree_block_len(cur) + > > + (n - 1) * xfs_btree_key_len(cur) + cur->bc_ops->key_len; > > } > > > > /* > > @@ -488,7 +540,7 @@ xfs_btree_ptr_offset( > > int level) > > { > > return xfs_btree_block_len(cur) + > > - cur->bc_ops->get_maxrecs(cur, level) * cur->bc_ops->key_len + > > + cur->bc_ops->get_maxrecs(cur, level) * xfs_btree_key_len(cur) + > > (n - 1) * xfs_btree_ptr_len(cur); > > } > > > > @@ -519,6 +571,19 @@ xfs_btree_key_addr( > > } > > > > /* > > + * Return a pointer to the n-th high key in the btree block. > > + */ > > +STATIC union xfs_btree_key * > > +xfs_btree_high_key_addr( > > + struct xfs_btree_cur *cur, > > + int n, > > + struct xfs_btree_block *block) > > +{ > > + return (union xfs_btree_key *) > > + ((char *)block + xfs_btree_high_key_offset(cur, n)); > > +} > > + > > +/* > > * Return a pointer to the n-th block pointer in the btree block. > > */ > > STATIC union xfs_btree_ptr * > > @@ -1217,7 +1282,7 @@ xfs_btree_copy_keys( > > int numkeys) > > { > > ASSERT(numkeys >= 0); > > - memcpy(dst_key, src_key, numkeys * cur->bc_ops->key_len); > > + memcpy(dst_key, src_key, numkeys * xfs_btree_key_len(cur)); > > } > > > > /* > > @@ -1263,8 +1328,8 @@ xfs_btree_shift_keys( > > ASSERT(numkeys >= 0); > > ASSERT(dir == 1 || dir == -1); > > > > - dst_key = (char *)key + (dir * cur->bc_ops->key_len); > > - memmove(dst_key, key, numkeys * cur->bc_ops->key_len); > > + dst_key = (char *)key + (dir * xfs_btree_key_len(cur)); > > + memmove(dst_key, key, numkeys * xfs_btree_key_len(cur)); > > } > > > > /* > > @@ -1879,6 +1944,180 @@ error0: > > return error; > > } > > > > +/* Determine the low and high keys of a leaf block */ > > +STATIC void > > +xfs_btree_find_leaf_keys( > > + struct xfs_btree_cur *cur, > > + struct xfs_btree_block *block, > > + union xfs_btree_key *low, > > + union xfs_btree_key *high) > > +{ > > + int n; > > + union xfs_btree_rec *rec; > > + union xfs_btree_key max_hkey; > > + union xfs_btree_key hkey; > > + > > + rec = xfs_btree_rec_addr(cur, 1, block); > > + cur->bc_ops->init_key_from_rec(low, rec); > > + > > + if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)) > > + return; > > + > > + cur->bc_ops->init_high_key_from_rec(&max_hkey, rec); > > + for (n = 2; n <= xfs_btree_get_numrecs(block); n++) { > > + rec = xfs_btree_rec_addr(cur, n, block); > > + cur->bc_ops->init_high_key_from_rec(&hkey, rec); > > + if (cur->bc_ops->diff_two_keys(cur, &max_hkey, &hkey) > 0) > > + max_hkey = hkey; > > + } > > + > > + *high = max_hkey; > > +} > > + > > +/* Determine the low and high keys of a node block */ > > +STATIC void > > +xfs_btree_find_node_keys( > > + struct xfs_btree_cur *cur, > > + struct xfs_btree_block *block, > > + union xfs_btree_key *low, > > + union xfs_btree_key *high) > > +{ > > + int n; > > + union xfs_btree_key *hkey; > > + union xfs_btree_key *max_hkey; > > + > > + *low = *xfs_btree_key_addr(cur, 1, block); > > + > > + if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)) > > + return; > > + > > + max_hkey = xfs_btree_high_key_addr(cur, 1, block); > > + for (n = 2; n <= xfs_btree_get_numrecs(block); n++) { > > + hkey = xfs_btree_high_key_addr(cur, n, block); > > + if (cur->bc_ops->diff_two_keys(cur, max_hkey, hkey) > 0) > > + max_hkey = hkey; > > + } > > + > > + *high = *max_hkey; > > +} > > + > > +/* > > + * Update parental low & high keys from some block all the way back to the > > + * root of the btree. > > + */ > > +STATIC int > > +__xfs_btree_updkeys( > > + struct xfs_btree_cur *cur, > > + int level, > > + struct xfs_btree_block *block, > > + struct xfs_buf *bp0, > > + bool force_all) > > +{ > > + union xfs_btree_key lkey; /* keys from current level */ > > + union xfs_btree_key hkey; > > + union xfs_btree_key *nlkey; /* keys from the next level up */ > > + union xfs_btree_key *nhkey; > > + struct xfs_buf *bp; > > + int ptr = -1; > > ptr doesn't appear to require initialization. Ok. > > > + > > + if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)) > > + return 0; > > + > > + if (level + 1 >= cur->bc_nlevels) > > + return 0; > > This could use a comment to indicate we're checking for a parent level > to update. Ok. > > > + > > + trace_xfs_btree_updkeys(cur, level, bp0); > > + > > + if (level == 0) > > + xfs_btree_find_leaf_keys(cur, block, &lkey, &hkey); > > + else > > + xfs_btree_find_node_keys(cur, block, &lkey, &hkey); > > + for (level++; level < cur->bc_nlevels; level++) { > > + block = xfs_btree_get_block(cur, level, &bp); > > + trace_xfs_btree_updkeys(cur, level, bp); > > + ptr = cur->bc_ptrs[level]; > > + nlkey = xfs_btree_key_addr(cur, ptr, block); > > + nhkey = xfs_btree_high_key_addr(cur, ptr, block); > > + if (!(cur->bc_ops->diff_two_keys(cur, nlkey, &lkey) != 0 || > > + cur->bc_ops->diff_two_keys(cur, nhkey, &hkey) != 0) && > > + !force_all) > > + break; > > + memcpy(nlkey, &lkey, cur->bc_ops->key_len); > > + memcpy(nhkey, &hkey, cur->bc_ops->key_len); > > + xfs_btree_log_keys(cur, bp, ptr, ptr); > > + if (level + 1 >= cur->bc_nlevels) > > + break; > > + xfs_btree_find_node_keys(cur, block, &lkey, &hkey); > > + } > > + > > + return 0; > > +} > > + > > +/* > > + * Update all the keys from a sibling block at some level in the cursor back > > + * to the root, stopping when we find a key pair that doesn't need updating. > > + */ > > +STATIC int > > +xfs_btree_sibling_updkeys( > > + struct xfs_btree_cur *cur, > > + int level, > > + int ptr, > > + struct xfs_btree_block *block, > > + struct xfs_buf *bp0) > > +{ > > + struct xfs_btree_cur *ncur; > > + int stat; > > + int error; > > + > > + error = xfs_btree_dup_cursor(cur, &ncur); > > + if (error) > > + return error; > > + > > + if (level + 1 >= ncur->bc_nlevels) > > + error = -EDOM; > > + else if (ptr == XFS_BB_RIGHTSIB) > > + error = xfs_btree_increment(ncur, level + 1, &stat); > > + else if (ptr == XFS_BB_LEFTSIB) > > + error = xfs_btree_decrement(ncur, level + 1, &stat); > > + else > > + error = -EBADE; > > So we inc/dec the cursor at the next level up the tree, then update the > keys up that path with the __xfs_btree_updkeys() call below. The inc/dec > calls explicitly say that they don't alter the cursor below the level, > so it looks like we'd end up with a weird cursor path here. > > Digging around further, it looks like we pass the sibling bp/block > pointers from the caller and thus __xfs_btree_updkeys() should do the > correct thing, but this is not very clear. If I'm on the right track, > I'd suggest to add a big fat comment here. :) Yep. /* * The caller passed us the sibling block in bp0/block, but the * (duplicate) cursor points to original block and not the sibling. * Therefore we must adjust the cursor at the next level higher * to point to the sibling block we were handed. Only then can * we go up the tree updating keys. */ > > + if (error || !stat) > > + return error; > > Looks like a potential cursor leak on error. Oops! > > + > > + error = __xfs_btree_updkeys(ncur, level, block, bp0, false); > > + xfs_btree_del_cursor(ncur, XFS_BTREE_NOERROR); > > + return error; > > +} > > + > > +/* > > + * Update all the keys from some level in cursor back to the root, stopping > > + * when we find a key pair that don't need updating. > > + */ > > +STATIC int > > +xfs_btree_updkeys( > > + struct xfs_btree_cur *cur, > > + int level) > > +{ > > + struct xfs_buf *bp; > > + struct xfs_btree_block *block; > > + > > + block = xfs_btree_get_block(cur, level, &bp); > > + return __xfs_btree_updkeys(cur, level, block, bp, false); > > +} > > + > > +/* Update all the keys from some level in cursor back to the root. */ > > +STATIC int > > +xfs_btree_updkeys_force( > > + struct xfs_btree_cur *cur, > > + int level) > > +{ > > + struct xfs_buf *bp; > > + struct xfs_btree_block *block; > > + > > + block = xfs_btree_get_block(cur, level, &bp); > > + return __xfs_btree_updkeys(cur, level, block, bp, true); > > +} > > + > > /* > > * Update keys at all levels from here to the root along the cursor's path. > > */ > > @@ -1893,6 +2132,9 @@ xfs_btree_updkey( > > union xfs_btree_key *kp; > > int ptr; > > > > + if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING) > > + return 0; > > + > > XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY); > > XFS_BTREE_TRACE_ARGIK(cur, level, keyp); > > > > @@ -1970,7 +2212,8 @@ xfs_btree_update( > > ptr, LASTREC_UPDATE); > > } > > > > - /* Updating first rec in leaf. Pass new key value up to our parent. */ > > + /* Pass new key value up to our parent. */ > > + xfs_btree_updkeys(cur, 0); > > if (ptr == 1) { > > union xfs_btree_key key; > > > > @@ -2149,7 +2392,9 @@ xfs_btree_lshift( > > rkp = &key; > > } > > > > - /* Update the parent key values of right. */ > > + /* Update the parent key values of left and right. */ > > + xfs_btree_sibling_updkeys(cur, level, XFS_BB_LEFTSIB, left, lbp); > > + xfs_btree_updkeys(cur, level); > > error = xfs_btree_updkey(cur, rkp, level + 1); > > if (error) > > goto error0; > > @@ -2321,6 +2566,9 @@ xfs_btree_rshift( > > if (error) > > goto error1; > > > > + /* Update left and right parent pointers */ > > + xfs_btree_updkeys(cur, level); > > + xfs_btree_updkeys(tcur, level); > > In this case, we grab the last record of the block, increment from there > and update using the cursor. This is much more straightforward, imo. > Could we use this approach in the left shift case as well? Yes, I think so. I might have started refactoring btree_sibling_updkeys out of existence and got distracted, since there isn't anything that uses the RIGHTSIB ptr value. > > error = xfs_btree_updkey(tcur, rkp, level + 1); > > if (error) > > goto error1; > > @@ -2356,7 +2604,7 @@ __xfs_btree_split( > > struct xfs_btree_cur *cur, > > int level, > > union xfs_btree_ptr *ptrp, > > - union xfs_btree_key *key, > > + struct xfs_btree_double_key *key, > > struct xfs_btree_cur **curp, > > int *stat) /* success/failure */ > > { > > @@ -2452,9 +2700,6 @@ __xfs_btree_split( > > > > xfs_btree_log_keys(cur, rbp, 1, rrecs); > > xfs_btree_log_ptrs(cur, rbp, 1, rrecs); > > - > > - /* Grab the keys to the entries moved to the right block */ > > - xfs_btree_copy_keys(cur, key, rkp, 1); > > } else { > > /* It's a leaf. Move records. */ > > union xfs_btree_rec *lrp; /* left record pointer */ > > @@ -2465,12 +2710,8 @@ __xfs_btree_split( > > > > xfs_btree_copy_recs(cur, rrp, lrp, rrecs); > > xfs_btree_log_recs(cur, rbp, 1, rrecs); > > - > > - cur->bc_ops->init_key_from_rec(key, > > - xfs_btree_rec_addr(cur, 1, right)); > > } > > > > - > > /* > > * Find the left block number by looking in the buffer. > > * Adjust numrecs, sibling pointers. > > @@ -2484,6 +2725,12 @@ __xfs_btree_split( > > xfs_btree_set_numrecs(left, lrecs); > > xfs_btree_set_numrecs(right, xfs_btree_get_numrecs(right) + rrecs); > > > > + /* Find the low & high keys for the new block. */ > > + if (level > 0) > > + xfs_btree_find_node_keys(cur, right, &key->low, &key->high); > > + else > > + xfs_btree_find_leaf_keys(cur, right, &key->low, &key->high); > > + > > Why not push these into the above if/else where the previous key > copy/init calls were removed from? We don't set bb_numrecs on the right block until the line above the new hunk, and the btree_find_*_keys functions require numrecs to be set. The removed key copy/init calls only looked at keys[1]. That said, it's trivial to move the set_numrecs calls above the if statement. > > xfs_btree_log_block(cur, rbp, XFS_BB_ALL_BITS); > > xfs_btree_log_block(cur, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB); > > > > @@ -2499,6 +2746,10 @@ __xfs_btree_split( > > xfs_btree_set_sibling(cur, rrblock, &rptr, XFS_BB_LEFTSIB); > > xfs_btree_log_block(cur, rrbp, XFS_BB_LEFTSIB); > > } > > + > > + /* Update the left block's keys... */ > > + xfs_btree_updkeys(cur, level); > > + > > /* > > * If the cursor is really in the right block, move it there. > > * If it's just pointing past the last entry in left, then we'll > > @@ -2537,7 +2788,7 @@ struct xfs_btree_split_args { > > struct xfs_btree_cur *cur; > > int level; > > union xfs_btree_ptr *ptrp; > > - union xfs_btree_key *key; > > + struct xfs_btree_double_key *key; > > struct xfs_btree_cur **curp; > > int *stat; /* success/failure */ > > int result; > > @@ -2586,7 +2837,7 @@ xfs_btree_split( > > struct xfs_btree_cur *cur, > > int level, > > union xfs_btree_ptr *ptrp, > > - union xfs_btree_key *key, > > + struct xfs_btree_double_key *key, > > struct xfs_btree_cur **curp, > > int *stat) /* success/failure */ > > { > > @@ -2806,27 +3057,27 @@ xfs_btree_new_root( > > bp = lbp; > > nptr = 2; > > } > > + > > /* Fill in the new block's btree header and log it. */ > > xfs_btree_init_block_cur(cur, nbp, cur->bc_nlevels, 2); > > xfs_btree_log_block(cur, nbp, XFS_BB_ALL_BITS); > > ASSERT(!xfs_btree_ptr_is_null(cur, &lptr) && > > !xfs_btree_ptr_is_null(cur, &rptr)); > > - > > ? Don't know why I did that. I like having one blank line before a chunk of code, but there's no reason to remove that one. > > /* Fill in the key data in the new root. */ > > if (xfs_btree_get_level(left) > 0) { > > - xfs_btree_copy_keys(cur, > > + xfs_btree_find_node_keys(cur, left, > > xfs_btree_key_addr(cur, 1, new), > > - xfs_btree_key_addr(cur, 1, left), 1); > > - xfs_btree_copy_keys(cur, > > + xfs_btree_high_key_addr(cur, 1, new)); > > + xfs_btree_find_node_keys(cur, right, > > xfs_btree_key_addr(cur, 2, new), > > - xfs_btree_key_addr(cur, 1, right), 1); > > + xfs_btree_high_key_addr(cur, 2, new)); > > } else { > > - cur->bc_ops->init_key_from_rec( > > - xfs_btree_key_addr(cur, 1, new), > > - xfs_btree_rec_addr(cur, 1, left)); > > - cur->bc_ops->init_key_from_rec( > > - xfs_btree_key_addr(cur, 2, new), > > - xfs_btree_rec_addr(cur, 1, right)); > > + xfs_btree_find_leaf_keys(cur, left, > > + xfs_btree_key_addr(cur, 1, new), > > + xfs_btree_high_key_addr(cur, 1, new)); > > + xfs_btree_find_leaf_keys(cur, right, > > + xfs_btree_key_addr(cur, 2, new), > > + xfs_btree_high_key_addr(cur, 2, new)); > > } > > xfs_btree_log_keys(cur, nbp, 1, 2); > > > > @@ -2837,6 +3088,7 @@ xfs_btree_new_root( > > xfs_btree_ptr_addr(cur, 2, new), &rptr, 1); > > xfs_btree_log_ptrs(cur, nbp, 1, 2); > > > > + > > Extra line. Removed. > > /* Fix up the cursor. */ > > xfs_btree_setbuf(cur, cur->bc_nlevels, nbp); > > cur->bc_ptrs[cur->bc_nlevels] = nptr; > > @@ -2862,7 +3114,7 @@ xfs_btree_make_block_unfull( > > int *index, /* new tree index */ > > union xfs_btree_ptr *nptr, /* new btree ptr */ > > struct xfs_btree_cur **ncur, /* new btree cursor */ > > - union xfs_btree_key *key, /* key of new block */ > > + struct xfs_btree_double_key *key, /* key of new block */ > > int *stat) > > { > > int error = 0; > > @@ -2918,6 +3170,22 @@ xfs_btree_make_block_unfull( > > return 0; > > } > > > > +/* Copy a double key into a btree block. */ > > +static void > > +xfs_btree_copy_double_keys( > > + struct xfs_btree_cur *cur, > > + int ptr, > > + struct xfs_btree_block *block, > > + struct xfs_btree_double_key *key) > > +{ > > + memcpy(xfs_btree_key_addr(cur, ptr, block), &key->low, > > + cur->bc_ops->key_len); > > + > > + if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING) > > + memcpy(xfs_btree_high_key_addr(cur, ptr, block), &key->high, > > + cur->bc_ops->key_len); > > +} > > + > > /* > > * Insert one record/level. Return information to the caller > > * allowing the next level up to proceed if necessary. > > @@ -2927,7 +3195,7 @@ xfs_btree_insrec( > > struct xfs_btree_cur *cur, /* btree cursor */ > > int level, /* level to insert record at */ > > union xfs_btree_ptr *ptrp, /* i/o: block number inserted */ > > - union xfs_btree_key *key, /* i/o: block key for ptrp */ > > + struct xfs_btree_double_key *key, /* i/o: block key for ptrp */ > > struct xfs_btree_cur **curp, /* output: new cursor replacing cur */ > > int *stat) /* success/failure */ > > { > > @@ -2935,7 +3203,7 @@ xfs_btree_insrec( > > struct xfs_buf *bp; /* buffer for block */ > > union xfs_btree_ptr nptr; /* new block ptr */ > > struct xfs_btree_cur *ncur; /* new btree cursor */ > > - union xfs_btree_key nkey; /* new block key */ > > + struct xfs_btree_double_key nkey; /* new block key */ > > union xfs_btree_rec rec; /* record to insert */ > > int optr; /* old key/record index */ > > int ptr; /* key/record index */ > > @@ -2944,11 +3212,12 @@ xfs_btree_insrec( > > #ifdef DEBUG > > int i; > > #endif > > + xfs_daddr_t old_bn; > > > > /* Make a key out of the record data to be inserted, and save it. */ > > if (level == 0) { > > cur->bc_ops->init_rec_from_cur(cur, &rec); > > - cur->bc_ops->init_key_from_rec(key, &rec); > > + cur->bc_ops->init_key_from_rec(&key->low, &rec); > > } > > > > XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY); > > @@ -2983,6 +3252,7 @@ xfs_btree_insrec( > > > > /* Get pointers to the btree buffer and block. */ > > block = xfs_btree_get_block(cur, level, &bp); > > + old_bn = bp ? bp->b_bn : XFS_BUF_DADDR_NULL; > > numrecs = xfs_btree_get_numrecs(block); > > > > #ifdef DEBUG > > @@ -2996,7 +3266,7 @@ xfs_btree_insrec( > > ASSERT(cur->bc_ops->recs_inorder(cur, &rec, > > xfs_btree_rec_addr(cur, ptr, block))); > > } else { > > - ASSERT(cur->bc_ops->keys_inorder(cur, key, > > + ASSERT(cur->bc_ops->keys_inorder(cur, &key->low, > > xfs_btree_key_addr(cur, ptr, block))); > > } > > } > > @@ -3059,7 +3329,7 @@ xfs_btree_insrec( > > #endif > > > > /* Now put the new data in, bump numrecs and log it. */ > > - xfs_btree_copy_keys(cur, kp, key, 1); > > + xfs_btree_copy_double_keys(cur, ptr, block, key); > > xfs_btree_copy_ptrs(cur, pp, ptrp, 1); > > numrecs++; > > xfs_btree_set_numrecs(block, numrecs); > > @@ -3095,8 +3365,24 @@ xfs_btree_insrec( > > xfs_btree_log_block(cur, bp, XFS_BB_NUMRECS); > > > > /* If we inserted at the start of a block, update the parents' keys. */ > > This comment is associated with the codeblock that has been pushed > further down, no? Correct. I think that got mismerged somewhere along the way. > > + if (ncur && bp->b_bn != old_bn) { > > + /* > > + * We just inserted into a new tree block, which means that > > + * the key for the block is in nkey, not the tree. > > + */ > > + if (level == 0) > > + xfs_btree_find_leaf_keys(cur, block, &nkey.low, > > + &nkey.high); > > + else > > + xfs_btree_find_node_keys(cur, block, &nkey.low, > > + &nkey.high); > > + } else { > > + /* Updating the left block, do it the standard way. */ > > + xfs_btree_updkeys(cur, level); > > + } > > + > > Not quite sure I follow the purpose of this hunk. Is this for the case > where a btree split occurs, nkey is filled in for the new/right block > and then (after nkey is filled in) the new record ends up being added to > the new block? If so, what about the case where ncur is not created? > (It looks like that's possible from the code, but I could easily be > missing some context as to why that's not the case.) Yes, the first part of the if-else hunk is to fill out nkey when we've split a btree block. Now that I look at it again, I think that whole weird conditional could be replaced with the same xfs_btree_ptr_is_null() check later on. I think it can also be combined with it. Commentage for now: /* * If we just inserted a new tree block, we have to find the low * and high keys for the new block and arrange to pass them back * separately. If we're just updating a block we can use the * regular tree update mechanism. */ > In any event, I think we could elaborate a bit in the comment on why > this is necessary. I'd also move it above the top-level if/else. > > > if (optr == 1) { > > - error = xfs_btree_updkey(cur, key, level + 1); > > + error = xfs_btree_updkey(cur, &key->low, level + 1); > > if (error) > > goto error0; > > } > > @@ -3147,7 +3433,7 @@ xfs_btree_insert( > > union xfs_btree_ptr nptr; /* new block number (split result) */ > > struct xfs_btree_cur *ncur; /* new cursor (split result) */ > > struct xfs_btree_cur *pcur; /* previous level's cursor */ > > - union xfs_btree_key key; /* key of block to insert */ > > + struct xfs_btree_double_key key; /* key of block to insert */ > > Probably should fix up the function param alignment here and the couple > other or so places we make this change. I changed the name to xfs_btree_bigkey, which avoids the alignment problems. --D > > Brian > > > > > level = 0; > > ncur = NULL; > > @@ -3552,6 +3838,7 @@ xfs_btree_delrec( > > * If we deleted the leftmost entry in the block, update the > > * key values above us in the tree. > > */ > > + xfs_btree_updkeys(cur, level); > > if (ptr == 1) { > > error = xfs_btree_updkey(cur, keyp, level + 1); > > if (error) > > @@ -3882,6 +4169,16 @@ xfs_btree_delrec( > > if (level > 0) > > cur->bc_ptrs[level]--; > > > > + /* > > + * We combined blocks, so we have to update the parent keys if the > > + * btree supports overlapped intervals. However, bc_ptrs[level + 1] > > + * points to the old block so that the caller knows which record to > > + * delete. Therefore, the caller must be savvy enough to call updkeys > > + * for us if we return stat == 2. The other exit points from this > > + * function don't require deletions further up the tree, so they can > > + * call updkeys directly. > > + */ > > + > > XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT); > > /* Return value means the next level up has something to do. */ > > *stat = 2; > > @@ -3907,6 +4204,7 @@ xfs_btree_delete( > > int error; /* error return value */ > > int level; > > int i; > > + bool joined = false; > > > > XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY); > > > > @@ -3920,8 +4218,17 @@ xfs_btree_delete( > > error = xfs_btree_delrec(cur, level, &i); > > if (error) > > goto error0; > > + if (i == 2) > > + joined = true; > > } > > > > + /* > > + * If we combined blocks as part of deleting the record, delrec won't > > + * have updated the parent keys so we have to do that here. > > + */ > > + if (joined) > > + xfs_btree_updkeys_force(cur, 0); > > + > > if (i == 0) { > > for (level = 1; level < cur->bc_nlevels; level++) { > > if (cur->bc_ptrs[level] == 0) { > > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h > > index b99c018..a5ec6c7 100644 > > --- a/fs/xfs/libxfs/xfs_btree.h > > +++ b/fs/xfs/libxfs/xfs_btree.h > > @@ -126,6 +126,9 @@ struct xfs_btree_ops { > > size_t key_len; > > size_t rec_len; > > > > + /* flags */ > > + uint flags; > > + > > /* cursor operations */ > > struct xfs_btree_cur *(*dup_cursor)(struct xfs_btree_cur *); > > void (*update_cursor)(struct xfs_btree_cur *src, > > @@ -162,11 +165,21 @@ struct xfs_btree_ops { > > union xfs_btree_rec *rec); > > void (*init_ptr_from_cur)(struct xfs_btree_cur *cur, > > union xfs_btree_ptr *ptr); > > + void (*init_high_key_from_rec)(union xfs_btree_key *key, > > + union xfs_btree_rec *rec); > > > > /* difference between key value and cursor value */ > > __int64_t (*key_diff)(struct xfs_btree_cur *cur, > > union xfs_btree_key *key); > > > > + /* > > + * Difference between key2 and key1 -- positive if key2 > key1, > > + * negative if key2 < key1, and zero if equal. > > + */ > > + __int64_t (*diff_two_keys)(struct xfs_btree_cur *cur, > > + union xfs_btree_key *key1, > > + union xfs_btree_key *key2); > > + > > const struct xfs_buf_ops *buf_ops; > > > > #if defined(DEBUG) || defined(XFS_WARN) > > @@ -182,6 +195,9 @@ struct xfs_btree_ops { > > #endif > > }; > > > > +/* btree ops flags */ > > +#define XFS_BTREE_OPS_OVERLAPPING (1<<0) /* overlapping intervals */ > > + > > /* > > * Reasons for the update_lastrec method to be called. > > */ > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > index 68f27f7..ffea28c 100644 > > --- a/fs/xfs/xfs_trace.h > > +++ b/fs/xfs/xfs_trace.h > > @@ -38,6 +38,7 @@ struct xlog_recover_item; > > struct xfs_buf_log_format; > > struct xfs_inode_log_format; > > struct xfs_bmbt_irec; > > +struct xfs_btree_cur; > > > > DECLARE_EVENT_CLASS(xfs_attr_list_class, > > TP_PROTO(struct xfs_attr_list_context *ctx), > > @@ -2183,6 +2184,41 @@ DEFINE_DISCARD_EVENT(xfs_discard_toosmall); > > DEFINE_DISCARD_EVENT(xfs_discard_exclude); > > DEFINE_DISCARD_EVENT(xfs_discard_busy); > > > > +/* btree cursor events */ > > +DECLARE_EVENT_CLASS(xfs_btree_cur_class, > > + TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), > > + TP_ARGS(cur, level, bp), > > + TP_STRUCT__entry( > > + __field(dev_t, dev) > > + __field(xfs_btnum_t, btnum) > > + __field(int, level) > > + __field(int, nlevels) > > + __field(int, ptr) > > + __field(xfs_daddr_t, daddr) > > + ), > > + TP_fast_assign( > > + __entry->dev = cur->bc_mp->m_super->s_dev; > > + __entry->btnum = cur->bc_btnum; > > + __entry->level = level; > > + __entry->nlevels = cur->bc_nlevels; > > + __entry->ptr = cur->bc_ptrs[level]; > > + __entry->daddr = bp->b_bn; > > + ), > > + TP_printk("dev %d:%d btnum %d level %d/%d ptr %d daddr 0x%llx", > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > + __entry->btnum, > > + __entry->level, > > + __entry->nlevels, > > + __entry->ptr, > > + (unsigned long long)__entry->daddr) > > +) > > + > > +#define DEFINE_BTREE_CUR_EVENT(name) \ > > +DEFINE_EVENT(xfs_btree_cur_class, name, \ > > + TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), \ > > + TP_ARGS(cur, level, bp)) > > +DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys); > > + > > #endif /* _TRACE_XFS_H */ > > > > #undef TRACE_INCLUDE_PATH > > > > _______________________________________________ > > xfs mailing list > > xfs@xxxxxxxxxxx > > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html