Re: [PATCH 013/119] xfs: support btrees with overlapping intervals for keys

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 28, 2016 at 08:32:04AM -0400, Brian Foster wrote:
> On Mon, Jun 27, 2016 at 08:26:21PM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 22, 2016 at 11:17:06AM -0400, Brian Foster wrote:
> > > On Thu, Jun 16, 2016 at 06:19:15PM -0700, Darrick J. Wong wrote:
> > > > On a filesystem with both reflink and reverse mapping enabled, it's
> > > > possible to have multiple rmap records referring to the same blocks on
> > > > disk.  When overlapping intervals are possible, querying a classic
> > > > btree to find all records intersecting a given interval is inefficient
> > > > because we cannot use the left side of the search interval to filter
> > > > out non-matching records the same way that we can use the existing
> > > > btree key to filter out records coming after the right side of the
> > > > search interval.  This will become important once we want to use the
> > > > rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.
> > > > 
> > > > (For the non-overlapping case, we can perform such queries trivially
> > > > by starting at the left side of the interval and walking the tree
> > > > until we pass the right side.)
> > > > 
> > > > Therefore, extend the btree code to come closer to supporting
> > > > intervals as a first-class record attribute.  This involves widening
> > > > the btree node's key space to store both the lowest key reachable via
> > > > the node pointer (as the btree does now) and the highest key reachable
> > > > via the same pointer and teaching the btree modifying functions to
> > > > keep the highest-key records up to date.
> > > > 
> > > > This behavior can be turned on via a new btree ops flag so that btrees
> > > > that cannot store overlapping intervals don't pay the overhead costs
> > > > in terms of extra code and disk format changes.
> > > > 
> > > > v2: When we're deleting a record in a btree that supports overlapped
> > > > interval records and the deletion results in two btree blocks being
> > > > joined, we defer updating the high/low keys until after all possible
> > > > joining (at higher levels in the tree) have finished.  At this point,
> > > > the btree pointers at all levels have been updated to remove the empty
> > > > blocks and we can update the low and high keys.
> > > > 
> > > > When we're doing this, we must be careful to update the keys of all
> > > > node pointers up to the root instead of stopping at the first set of
> > > > keys that don't need updating.  This is because it's possible for a
> > > > single deletion to cause joining of multiple levels of tree, and so
> > > > we need to update everything going back to the root.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > ---
> > > 
> > > I think I get the gist of this and it mostly looks Ok to me. A few
> > > questions and minor comments...
> > 
> > Ok.
> > 
> > > >  fs/xfs/libxfs/xfs_btree.c |  379 +++++++++++++++++++++++++++++++++++++++++----
> > > >  fs/xfs/libxfs/xfs_btree.h |   16 ++
> > > >  fs/xfs/xfs_trace.h        |   36 ++++
> > > >  3 files changed, 395 insertions(+), 36 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > > > index a096539..afcafd6 100644
> > > > --- a/fs/xfs/libxfs/xfs_btree.c
> > > > +++ b/fs/xfs/libxfs/xfs_btree.c
> ...
> > > > @@ -2149,7 +2392,9 @@ xfs_btree_lshift(
> > > >  		rkp = &key;
> > > >  	}
> > > >  
> > > > -	/* Update the parent key values of right. */
> > > > +	/* Update the parent key values of left and right. */
> > > > +	xfs_btree_sibling_updkeys(cur, level, XFS_BB_LEFTSIB, left, lbp);
> > > > +	xfs_btree_updkeys(cur, level);
> > > >  	error = xfs_btree_updkey(cur, rkp, level + 1);
> > > >  	if (error)
> > > >  		goto error0;
> > > > @@ -2321,6 +2566,9 @@ xfs_btree_rshift(
> > > >  	if (error)
> > > >  		goto error1;
> > > >  
> > > > +	/* Update left and right parent pointers */
> > > > +	xfs_btree_updkeys(cur, level);
> > > > +	xfs_btree_updkeys(tcur, level);
> > > 
> > > In this case, we grab the last record of the block, increment from there
> > > and update using the cursor. This is much more straightforward, imo.
> > > Could we use this approach in the left shift case as well?
> > 
> > Yes, I think so.  I might have started refactoring btree_sibling_updkeys
> > out of existence and got distracted, since there isn't anything that uses
> > the RIGHTSIB ptr value.
> > 
> 
> Ok, I think that would be much cleaner.

Done.

> > > >  	error = xfs_btree_updkey(tcur, rkp, level + 1);
> > > >  	if (error)
> > > >  		goto error1;
> > > > @@ -2356,7 +2604,7 @@ __xfs_btree_split(
> > > >  	struct xfs_btree_cur	*cur,
> > > >  	int			level,
> > > >  	union xfs_btree_ptr	*ptrp,
> > > > -	union xfs_btree_key	*key,
> > > > +	struct xfs_btree_double_key	*key,
> > > >  	struct xfs_btree_cur	**curp,
> > > >  	int			*stat)		/* success/failure */
> > > >  {
> > > > @@ -2452,9 +2700,6 @@ __xfs_btree_split(
> > > >  
> > > >  		xfs_btree_log_keys(cur, rbp, 1, rrecs);
> > > >  		xfs_btree_log_ptrs(cur, rbp, 1, rrecs);
> > > > -
> > > > -		/* Grab the keys to the entries moved to the right block */
> > > > -		xfs_btree_copy_keys(cur, key, rkp, 1);
> > > >  	} else {
> > > >  		/* It's a leaf.  Move records.  */
> > > >  		union xfs_btree_rec	*lrp;	/* left record pointer */
> > > > @@ -2465,12 +2710,8 @@ __xfs_btree_split(
> > > >  
> > > >  		xfs_btree_copy_recs(cur, rrp, lrp, rrecs);
> > > >  		xfs_btree_log_recs(cur, rbp, 1, rrecs);
> > > > -
> > > > -		cur->bc_ops->init_key_from_rec(key,
> > > > -			xfs_btree_rec_addr(cur, 1, right));
> > > >  	}
> > > >  
> > > > -
> > > >  	/*
> > > >  	 * Find the left block number by looking in the buffer.
> > > >  	 * Adjust numrecs, sibling pointers.
> > > > @@ -2484,6 +2725,12 @@ __xfs_btree_split(
> > > >  	xfs_btree_set_numrecs(left, lrecs);
> > > >  	xfs_btree_set_numrecs(right, xfs_btree_get_numrecs(right) + rrecs);
> > > >  
> > > > +	/* Find the low & high keys for the new block. */
> > > > +	if (level > 0)
> > > > +		xfs_btree_find_node_keys(cur, right, &key->low, &key->high);
> > > > +	else
> > > > +		xfs_btree_find_leaf_keys(cur, right, &key->low, &key->high);
> > > > +
> > > 
> > > Why not push these into the above if/else where the previous key
> > > copy/init calls were removed from?
> > 
> > We don't set bb_numrecs on the right block until the line above the new
> > hunk, and the btree_find_*_keys functions require numrecs to be set.
> > 
> > The removed key copy/init calls only looked at keys[1].
> > 
> > That said, it's trivial to move the set_numrecs calls above the if statement.
> > 
> 
> Ok, thanks. No need to shuffle it around. I'd suggest a one-liner
> comment though so somebody doesn't blindly refactor this down the road.
> It also sounds like the find keys functions could use ASSERT() checks
> for a sane bb_numrecs.

Hmm.  I already moved it, oh well.

It _does_ make the function less messy, so I'll leave it unless anyone yells.

> > > >  	xfs_btree_log_block(cur, rbp, XFS_BB_ALL_BITS);
> > > >  	xfs_btree_log_block(cur, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB);
> > > >  
> ...
> > > > @@ -3095,8 +3365,24 @@ xfs_btree_insrec(
> > > >  	xfs_btree_log_block(cur, bp, XFS_BB_NUMRECS);
> > > >  
> > > >  	/* If we inserted at the start of a block, update the parents' keys. */
> > > 
> > > This comment is associated with the codeblock that has been pushed
> > > further down, no?
> > 
> > Correct.  I think that got mismerged somewhere along the way.
> > 
> > > > +	if (ncur && bp->b_bn != old_bn) {
> > > > +		/*
> > > > +		 * We just inserted into a new tree block, which means that
> > > > +		 * the key for the block is in nkey, not the tree.
> > > > +		 */
> > > > +		if (level == 0)
> > > > +			xfs_btree_find_leaf_keys(cur, block, &nkey.low,
> > > > +					&nkey.high);
> > > > +		else
> > > > +			xfs_btree_find_node_keys(cur, block, &nkey.low,
> > > > +					&nkey.high);
> > > > +	} else {
> > > > +		/* Updating the left block, do it the standard way. */
> > > > +		xfs_btree_updkeys(cur, level);
> > > > +	}
> > > > +
> > > 
> > > Not quite sure I follow the purpose of this hunk. Is this for the case
> > > where a btree split occurs, nkey is filled in for the new/right block
> > > and then (after nkey is filled in) the new record ends up being added to
> > > the new block? If so, what about the case where ncur is not created?
> > > (It looks like that's possible from the code, but I could easily be
> > > missing some context as to why that's not the case.)
> > 
> > Yes, the first part of the if-else hunk is to fill out nkey when we've
> > split a btree block.  Now that I look at it again, I think that whole
> > weird conditional could be replaced with the same xfs_btree_ptr_is_null()
> > check later on.  I think it can also be combined with it.

This is incorrect.  The only time we want to perform the nkey recalculation
is in the specific case where we split a btree block and the cursor ends up
pointing into the new block for the insertion.  We cannot gate the nkey
recalc on whether or not &nptr is null, because if we insert into the left
block after a split we'll recalculate nkey (the right block's keys) using
the left block's data, which is incorrect.  We probably want to do the key
recalc ahead of calling ->update_lastrec because the callback could modify
the cursor.

So I'll just leave the code mostly as is, clarify the comments about what
we're doing and why, and change the if statement to:

if (bp && bp->b_bn != old_bn)

Also for some reason I neglected to check the return code from
xfs_btree_updkeys, so I will go fix that.  At the moment it doesn't matter
because updkeys never fails, but I might as well fix it now.

> Ok.
> 
> > Commentage for now:
> > 
> > /*
> >  * If we just inserted a new tree block, we have to find the low
> >  * and high keys for the new block and arrange to pass them back
> >  * separately.  If we're just updating a block we can use the
> >  * regular tree update mechanism.
> >  */
> > 
> 
> Couldn't you just point out that nkey may not be coherent with the new
> block if the new record was inserted therein..?

Yes, that would be less convoluted.  Done. :)

> > > In any event, I think we could elaborate a bit in the comment on why
> > > this is necessary. I'd also move it above the top-level if/else.
> > > 
> > > >  	if (optr == 1) {
> > > > -		error = xfs_btree_updkey(cur, key, level + 1);
> > > > +		error = xfs_btree_updkey(cur, &key->low, level + 1);
> > > >  		if (error)
> > > >  			goto error0;
> > > >  	}
> > > > @@ -3147,7 +3433,7 @@ xfs_btree_insert(
> > > >  	union xfs_btree_ptr	nptr;	/* new block number (split result) */
> > > >  	struct xfs_btree_cur	*ncur;	/* new cursor (split result) */
> > > >  	struct xfs_btree_cur	*pcur;	/* previous level's cursor */
> > > > -	union xfs_btree_key	key;	/* key of block to insert */
> > > > +	struct xfs_btree_double_key	key;	/* key of block to insert */
> > > 
> > > Probably should fix up the function param alignment here and the couple
> > > other or so places we make this change.
> > 
> > I changed the name to xfs_btree_bigkey, which avoids the alignment problems.
> > 
> 
> Sounds good.

--D

> 
> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > >  
> > > >  	level = 0;
> > > >  	ncur = NULL;
> > > > @@ -3552,6 +3838,7 @@ xfs_btree_delrec(
> > > >  	 * If we deleted the leftmost entry in the block, update the
> > > >  	 * key values above us in the tree.
> > > >  	 */
> > > > +	xfs_btree_updkeys(cur, level);
> > > >  	if (ptr == 1) {
> > > >  		error = xfs_btree_updkey(cur, keyp, level + 1);
> > > >  		if (error)
> > > > @@ -3882,6 +4169,16 @@ xfs_btree_delrec(
> > > >  	if (level > 0)
> > > >  		cur->bc_ptrs[level]--;
> > > >  
> > > > +	/*
> > > > +	 * We combined blocks, so we have to update the parent keys if the
> > > > +	 * btree supports overlapped intervals.  However, bc_ptrs[level + 1]
> > > > +	 * points to the old block so that the caller knows which record to
> > > > +	 * delete.  Therefore, the caller must be savvy enough to call updkeys
> > > > +	 * for us if we return stat == 2.  The other exit points from this
> > > > +	 * function don't require deletions further up the tree, so they can
> > > > +	 * call updkeys directly.
> > > > +	 */
> > > > +
> > > >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> > > >  	/* Return value means the next level up has something to do. */
> > > >  	*stat = 2;
> > > > @@ -3907,6 +4204,7 @@ xfs_btree_delete(
> > > >  	int			error;	/* error return value */
> > > >  	int			level;
> > > >  	int			i;
> > > > +	bool			joined = false;
> > > >  
> > > >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > > >  
> > > > @@ -3920,8 +4218,17 @@ xfs_btree_delete(
> > > >  		error = xfs_btree_delrec(cur, level, &i);
> > > >  		if (error)
> > > >  			goto error0;
> > > > +		if (i == 2)
> > > > +			joined = true;
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * If we combined blocks as part of deleting the record, delrec won't
> > > > +	 * have updated the parent keys so we have to do that here.
> > > > +	 */
> > > > +	if (joined)
> > > > +		xfs_btree_updkeys_force(cur, 0);
> > > > +
> > > >  	if (i == 0) {
> > > >  		for (level = 1; level < cur->bc_nlevels; level++) {
> > > >  			if (cur->bc_ptrs[level] == 0) {
> > > > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > > > index b99c018..a5ec6c7 100644
> > > > --- a/fs/xfs/libxfs/xfs_btree.h
> > > > +++ b/fs/xfs/libxfs/xfs_btree.h
> > > > @@ -126,6 +126,9 @@ struct xfs_btree_ops {
> > > >  	size_t	key_len;
> > > >  	size_t	rec_len;
> > > >  
> > > > +	/* flags */
> > > > +	uint	flags;
> > > > +
> > > >  	/* cursor operations */
> > > >  	struct xfs_btree_cur *(*dup_cursor)(struct xfs_btree_cur *);
> > > >  	void	(*update_cursor)(struct xfs_btree_cur *src,
> > > > @@ -162,11 +165,21 @@ struct xfs_btree_ops {
> > > >  				     union xfs_btree_rec *rec);
> > > >  	void	(*init_ptr_from_cur)(struct xfs_btree_cur *cur,
> > > >  				     union xfs_btree_ptr *ptr);
> > > > +	void	(*init_high_key_from_rec)(union xfs_btree_key *key,
> > > > +					  union xfs_btree_rec *rec);
> > > >  
> > > >  	/* difference between key value and cursor value */
> > > >  	__int64_t (*key_diff)(struct xfs_btree_cur *cur,
> > > >  			      union xfs_btree_key *key);
> > > >  
> > > > +	/*
> > > > +	 * Difference between key2 and key1 -- positive if key2 > key1,
> > > > +	 * negative if key2 < key1, and zero if equal.
> > > > +	 */
> > > > +	__int64_t (*diff_two_keys)(struct xfs_btree_cur *cur,
> > > > +				   union xfs_btree_key *key1,
> > > > +				   union xfs_btree_key *key2);
> > > > +
> > > >  	const struct xfs_buf_ops	*buf_ops;
> > > >  
> > > >  #if defined(DEBUG) || defined(XFS_WARN)
> > > > @@ -182,6 +195,9 @@ struct xfs_btree_ops {
> > > >  #endif
> > > >  };
> > > >  
> > > > +/* btree ops flags */
> > > > +#define XFS_BTREE_OPS_OVERLAPPING	(1<<0)	/* overlapping intervals */
> > > > +
> > > >  /*
> > > >   * Reasons for the update_lastrec method to be called.
> > > >   */
> > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > index 68f27f7..ffea28c 100644
> > > > --- a/fs/xfs/xfs_trace.h
> > > > +++ b/fs/xfs/xfs_trace.h
> > > > @@ -38,6 +38,7 @@ struct xlog_recover_item;
> > > >  struct xfs_buf_log_format;
> > > >  struct xfs_inode_log_format;
> > > >  struct xfs_bmbt_irec;
> > > > +struct xfs_btree_cur;
> > > >  
> > > >  DECLARE_EVENT_CLASS(xfs_attr_list_class,
> > > >  	TP_PROTO(struct xfs_attr_list_context *ctx),
> > > > @@ -2183,6 +2184,41 @@ DEFINE_DISCARD_EVENT(xfs_discard_toosmall);
> > > >  DEFINE_DISCARD_EVENT(xfs_discard_exclude);
> > > >  DEFINE_DISCARD_EVENT(xfs_discard_busy);
> > > >  
> > > > +/* btree cursor events */
> > > > +DECLARE_EVENT_CLASS(xfs_btree_cur_class,
> > > > +	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp),
> > > > +	TP_ARGS(cur, level, bp),
> > > > +	TP_STRUCT__entry(
> > > > +		__field(dev_t, dev)
> > > > +		__field(xfs_btnum_t, btnum)
> > > > +		__field(int, level)
> > > > +		__field(int, nlevels)
> > > > +		__field(int, ptr)
> > > > +		__field(xfs_daddr_t, daddr)
> > > > +	),
> > > > +	TP_fast_assign(
> > > > +		__entry->dev = cur->bc_mp->m_super->s_dev;
> > > > +		__entry->btnum = cur->bc_btnum;
> > > > +		__entry->level = level;
> > > > +		__entry->nlevels = cur->bc_nlevels;
> > > > +		__entry->ptr = cur->bc_ptrs[level];
> > > > +		__entry->daddr = bp->b_bn;
> > > > +	),
> > > > +	TP_printk("dev %d:%d btnum %d level %d/%d ptr %d daddr 0x%llx",
> > > > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > > +		  __entry->btnum,
> > > > +		  __entry->level,
> > > > +		  __entry->nlevels,
> > > > +		  __entry->ptr,
> > > > +		  (unsigned long long)__entry->daddr)
> > > > +)
> > > > +
> > > > +#define DEFINE_BTREE_CUR_EVENT(name) \
> > > > +DEFINE_EVENT(xfs_btree_cur_class, name, \
> > > > +	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), \
> > > > +	TP_ARGS(cur, level, bp))
> > > > +DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
> > > > +
> > > >  #endif /* _TRACE_XFS_H */
> > > >  
> > > >  #undef TRACE_INCLUDE_PATH
> > > > 
> > > > _______________________________________________
> > > > xfs mailing list
> > > > xfs@xxxxxxxxxxx
> > > > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@xxxxxxxxxxx
> > http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs



[Index of Archives]     [Linux XFS Devel]     [Linux Filesystem Development]     [Filesystem Testing]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux