Re: [PATCH] btrfs: drop the backref cache during relocation if we commit

Boris Burkov <boris@xxxxxx> · Thu, 26 Sep 2024 08:18:46 -0700

On Thu, Sep 26, 2024 at 10:31:33AM -0400, Josef Bacik wrote:
> On Wed, Sep 25, 2024 at 02:46:25PM -0700, Boris Burkov wrote:
> > On Wed, Sep 25, 2024 at 04:17:34PM -0400, Josef Bacik wrote:
> > > On Wed, Sep 25, 2024 at 12:07:43PM -0700, Boris Burkov wrote:
> > > > On Tue, Sep 24, 2024 at 05:11:37PM -0400, Josef Bacik wrote:
> > > > > Since the inception of relocation we have maintained the backref cache
> > > > > across transaction commits, updating the backref cache with the new
> > > > > bytenr whenever we COW'ed blocks that were in the cache, and then
> > > > > updating their bytenr once we detected a transaction id change.
> > > > > 
> > > > > This works as long as we're only ever modifying blocks, not changing the
> > > > > structure of the tree.
> > > > > 
> > > > > However relocation does in fact change the structure of the tree.  For
> > > > > example, if we are relocating a data extent, we will look up all the
> > > > > leaves that point to this data extent.  We will then call
> > > > > do_relocation() on each of these leaves, which will COW down to the leaf
> > > > > and then update the file extent location.
> > > > > 
> > > > > But, a key feature of do_relocation is the pending list.  This is all
> > > > > the pending nodes that we modified when we updated the file extent item.
> > > > > We will then process all of these blocks via finish_pending_nodes, which
> > > > > calls do_relocation() on all of the nodes that led up to that leaf.
> > > > > 
> > > > > The purpose of this is to make sure we don't break sharing unless we
> > > > > absolutely have to.  Consider the case that we have 3 snapshots that all
> > > > > point to this leaf through the same nodes, the initial COW would have
> > > > > created a whole new path.  If we did this for all 3 snapshots we would
> > > > > end up with 3x the number of nodes we had originally.  To avoid this we
> > > > > will cycle through each of the snapshots that point to each of these
> > > > > nodes and update their pointers to point at the new nodes.
> > > > > 
> > > > > Once we update the pointer to the new node we will drop the node we
> > > > > removed the link for and all of its children via btrfs_drop_subtree().
> > > > > This is essentially just btrfs_drop_snapshot(), but for an arbitrary
> > > > > point in the snapshot.
> > > > > 
> > > > > The problem with this is that we will never reflect this in the backref
> > > > > cache.  If we do this btrfs_drop_snapshot() for a node that is in the
> > > > > backref tree, we will leave the node in the backref tree.  This becomes
> > > > > a problem when we change the transid, as now the backref cache has
> > > > > entire subtrees that no longer exist, but exist as if they still are
> > > > > pointed to by the same roots.
> > > > > 
> > > > > In the best case scenario you end up with "adding refs to an existing
> > > > > tree ref" errors from insert_inline_extent_backref(), where we attempt
> > > > > to link in nodes on roots that are no longer valid.
> > > > > 
> > > > > Worst case you will double free some random block and re-use it when
> > > > > there's still references to the block.
> > > > > 
> > > > > This is extremely subtle, and the consequences are quite bad.  There
> > > > > isn't a way to make sure our backref cache is consistent between
> > > > > transid's.
> > > > > 
> > > > > In order to fix this we need to simply evict the entire backref cache
> > > > > anytime we cross transid's.  This reduces performance in that we have to
> > > > > rebuild this backref cache every time we change transid's, but fixes the
> > > > > bug.
> > > > > 
> > > > > This has existed since relocation was added, and is a pretty critical
> > > > > bug.  There's a lot more cleanup that can be done now that this
> > > > > functionality is going away, but this patch is as small as possible in
> > > > > order to fix the problem and make it easy for us to backport it to all
> > > > > the kernels it needs to be backported to.
> > > > > 
> > > > > Followup series will dismantle more of this code and simplify relocation
> > > > > drastically to remove this functionality.
> > > > 
> > > > +1
> > > > 
> > > > > 
> > > > > We have a reproducer that reproduced the corruption within a few minutes
> > > > > of running.  With this patch it survives several iterations/hours of
> > > > > running the reproducer.
> > > > > 
> > > > > Fixes: 3fd0a5585eb9 ("Btrfs: Metadata ENOSPC handling for balance")
> > > > > Cc: stable@xxxxxxxxxxxxxxx
> > > > > Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>

Reviewed-by: Boris Burkov <boris@xxxxxx>

> > > > > ---
> > > > >  fs/btrfs/backref.c    | 12 ++++---
> > > > >  fs/btrfs/relocation.c | 76 +++----------------------------------------
> > > > >  2 files changed, 13 insertions(+), 75 deletions(-)
> > > > > 
> > > > > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> > > > > index e2f478ecd7fd..f8e1d5b2c512 100644
> > > > > --- a/fs/btrfs/backref.c
> > > > > +++ b/fs/btrfs/backref.c
> > > > > @@ -3179,10 +3179,14 @@ void btrfs_backref_release_cache(struct btrfs_backref_cache *cache)
> > > > >  		btrfs_backref_cleanup_node(cache, node);
> > > > >  	}
> > > > >  
> > > > > -	cache->last_trans = 0;
> > > > > -
> > > > > -	for (i = 0; i < BTRFS_MAX_LEVEL; i++)
> > > > > -		ASSERT(list_empty(&cache->pending[i]));
> > > > > +	for (i = 0; i < BTRFS_MAX_LEVEL; i++) {
> > > > > +		while (!list_empty(&cache->pending[i])) {
> > > > > +			node = list_first_entry(&cache->pending[i],
> > > > > +						struct btrfs_backref_node,
> > > > > +						list);
> > > > > +			btrfs_backref_cleanup_node(cache, node);
> > > > > +		}
> > > > > +	}
> > > > 
> > > > Non blocking suggestion:
> > > > 
> > > > The fact that this cleanup has to keep an accurate view of the leaves
> > > > while it runs feels like overkill. If we just maintained a linked list
> > > > of all the nodes/edges we could call kfree on all of them. I think the
> > > > existing rbtree of nodes would probably just work too?
> > > > 
> > > > I think just adding the pending instead of assuming it's empty makes
> > > > sense, though.
> > > 
> > > This is one of the cleanups that's coming, because we keep a list of everything
> > > in either ->leaves or ->detached, and now we can have them in the pending list.
> > > The cleanup function assumes that everything that is handled at this point so
> > > everything should be empty.
> > > 
> > > This is obviously silly, so the cleanup is to iterate the rbtree and free that
> > > way and not have all this stuff.
> > > 
> > > We were only calling this after we were completely done with the loop for
> > > backref cache, so the pending lists should have been empty if there was an error
> > > and cleaned up before calling this function.
> > > 
> > > But now because we're calling it every time we have to make sure we evict the
> > > pending list ourselves.
> > > 
> > > Again, this was done for simplicity sake, the problem is bad enough we want it
> > > to be backported as widely as possible, so all other cleanup stuff is going to
> > > be done separately.
> > > 
> > > > 
> > > > >  	ASSERT(list_empty(&cache->pending_edge));
> > > > >  	ASSERT(list_empty(&cache->useless_node));
> > > > >  	ASSERT(list_empty(&cache->changed));
> > > > > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> > > > > index ea4ed85919ec..aaa9cac213f1 100644
> > > > > --- a/fs/btrfs/relocation.c
> > > > > +++ b/fs/btrfs/relocation.c
> > > > > @@ -232,70 +232,6 @@ static struct btrfs_backref_node *walk_down_backref(
> > > > >  	return NULL;
> > > > >  }
> > > > >  
> > > > > -static void update_backref_node(struct btrfs_backref_cache *cache,
> > > > > -				struct btrfs_backref_node *node, u64 bytenr)
> > > > > -{
> > > > > -	struct rb_node *rb_node;
> > > > > -	rb_erase(&node->rb_node, &cache->rb_root);
> > > > > -	node->bytenr = bytenr;
> > > > > -	rb_node = rb_simple_insert(&cache->rb_root, node->bytenr, &node->rb_node);
> > > > > -	if (rb_node)
> > > > > -		btrfs_backref_panic(cache->fs_info, bytenr, -EEXIST);
> > > > > -}
> > > > > -
> > > > > -/*
> > > > > - * update backref cache after a transaction commit
> > > > > - */
> > > > > -static int update_backref_cache(struct btrfs_trans_handle *trans,
> > > > > -				struct btrfs_backref_cache *cache)
> > > > > -{
> > > > > -	struct btrfs_backref_node *node;
> > > > > -	int level = 0;
> > > > > -
> > > > > -	if (cache->last_trans == 0) {
> > > > > -		cache->last_trans = trans->transid;
> > > > > -		return 0;
> > > > > -	}
> > > > > -
> > > > > -	if (cache->last_trans == trans->transid)
> > > > > -		return 0;
> > > > > -
> > > > > -	/*
> > > > > -	 * detached nodes are used to avoid unnecessary backref
> > > > > -	 * lookup. transaction commit changes the extent tree.
> > > > > -	 * so the detached nodes are no longer useful.
> > > > > -	 */
> > > > > -	while (!list_empty(&cache->detached)) {
> > > > > -		node = list_entry(cache->detached.next,
> > > > > -				  struct btrfs_backref_node, list);
> > > > > -		btrfs_backref_cleanup_node(cache, node);
> > > > > -	}
> > > > > -
> > > > > -	while (!list_empty(&cache->changed)) {
> > > > > -		node = list_entry(cache->changed.next,
> > > > > -				  struct btrfs_backref_node, list);
> > > > > -		list_del_init(&node->list);
> > > > > -		BUG_ON(node->pending);
> > > > > -		update_backref_node(cache, node, node->new_bytenr);
> > > > > -	}
> > > > > -
> > > > > -	/*
> > > > > -	 * some nodes can be left in the pending list if there were
> > > > > -	 * errors during processing the pending nodes.
> > > > > -	 */
> > > > > -	for (level = 0; level < BTRFS_MAX_LEVEL; level++) {
> > > > > -		list_for_each_entry(node, &cache->pending[level], list) {
> > > > > -			BUG_ON(!node->pending);
> > > > > -			if (node->bytenr == node->new_bytenr)
> > > > > -				continue;
> > > > > -			update_backref_node(cache, node, node->new_bytenr);
> > > > > -		}
> > > > > -	}
> > > > > -
> > > > > -	cache->last_trans = 0;
> > > > > -	return 1;
> > > > > -}
> > > > > -
> > > > >  static bool reloc_root_is_dead(const struct btrfs_root *root)
> > > > >  {
> > > > >  	/*
> > > > > @@ -551,9 +487,6 @@ static int clone_backref_node(struct btrfs_trans_handle *trans,
> > > > >  	struct btrfs_backref_edge *new_edge;
> > > > >  	struct rb_node *rb_node;
> > > > >  
> > > > > -	if (cache->last_trans > 0)
> > > > > -		update_backref_cache(trans, cache);
> > > > > -
> > > > 
> > > > This looks suspicious to me. You said in the commit message that we need
> > > > to nuke the cache any time we cross a transid. However, here, you detect
> > > > a changed transid s.t. we would be calling update (presumably for some
> > > > reason someone felt was important) but now we are just skipping it.
> > > > 
> > > > Why is this correct/safe? Do we need to dump the cache here too? Are we
> > > > never going to hit this because of some implicit synchronization between
> > > > this post snapshot path and the cache-blowing-up you added to
> > > > relocate_block_group?
> > > > 
> > > > And if we need to dump the cache here to be safe, are you sure the other
> > > > places we can go into the relocation code from outside
> > > > relocate_block_group are safe to interact with an old cache too? I'm
> > > > thinking of btrfs_reloc_cow_block primarily.
> > > 
> > > This is an extremely subtle part.  clone_backref_node() is called when we create
> > > the reloc root, which can happen out of band of the relocation, so outside of
> > > the relocate_block_group loop.
> > > 
> > > This was made to work in conjunction with the updating logic, so we had to call
> > > it here in order to make sure the code in clone found the right nodes, because
> > > src->commit_root->start would have been in ->new_bytenr if we hadn't updated the
> > > backref cache yet.
> > > 
> > > Now we don't care, if it happens out of band we actually don't want to keep
> > > those cloned backref cache, so the relocate block group loop will evict the
> > > cache and rebuild it anyway.
> > > 
> > > If this happens where it should, eg select_reloc_root, then the cache will be
> > > uptodate and src->commit_root->start will match what exists in the backref
> > > cache.
> > > 
> > > This is safe specifically because in the relocate block group loop we will evict
> > > everything, and then any reloc root creation that happens once we have the
> > > bacrekf cache in place will be synchronized appropriately.
> > 
> > I thought about it some more, and noticed that the clone_backref_node
> > can only be called in the transaction critical section in
> > create_pending_snapshot, while the btrfs_relocate_block_group can only
> > run under a transaction, so if clone_backref_node runs, relocation is
> > guaranteed to see a changed transid by the time it runs. Since the
> > effects of clone_backref_node are contained entirely to the backref
> > cache, which we are guaranteed is gonna be blown away next relocation
> > step, I think we can just omit clone_backref_node entirely?
> 
> Yeah, it's getting deleted in followup patches.
> 
> > 
> > Whether we want to get rid of it entirely or not, I am relatively
> > convinced now that this aspect of the patch is fine.
> > 
> > On the other hand, I am now extremely concerned about the interaction
> > between the new relocate_block_group and btrfs_reloc_cow_block...
> > > 
> > > > 
> > > > >  	rb_node = rb_simple_search(&cache->rb_root, src->commit_root->start);
> > > > >  	if (rb_node) {
> > > > >  		node = rb_entry(rb_node, struct btrfs_backref_node, rb_node);
> > > > > @@ -3698,10 +3631,11 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
> > > > >  			break;
> > > > >  		}
> > > > >  restart:
> > > > > -		if (update_backref_cache(trans, &rc->backref_cache)) {
> > > > > -			btrfs_end_transaction(trans);
> > > > > -			trans = NULL;
> > > > > -			continue;
> > > > > +		if (rc->backref_cache.last_trans == 0) {
> > > > > +			rc->backref_cache.last_trans = trans->transid;
> > > > > +		} else if (rc->backref_cache.last_trans != trans->transid) {
> > > > > +			btrfs_backref_release_cache(&rc->backref_cache);
> > 
> > What prevents a race between this and btrfs_reloc_cow_block running
> > node = rc->backref_cache.path[level];
> > 
> > I feel like I must be missing some higher level synchronization between
> > the two, but if my suspicion is right, btrfs_reloc_cow_block would now
> > grab a backref_cache node and use it while btrfs_backref_release_cache
> > called kfree on it, probably resulting in UAF.
> 
> This is the bit you're missing
> 
> if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID && rc->create_reloc_tree) {
> 
> We only do this if we're the reloc root, and we only search down the reloc root
> from relocation, so we know if we got here and we're the reloc root that
> everything is setup properly.

Ack, that makes sense.

> 
> > 
> > > > > +			rc->backref_cache.last_trans = trans->transid;
> > > > 
> > > > Non blocking suggestion:
> > > > This feels like it could be simpler if we initialized last_trans to 0
> > > > after allocated the reloc_control, then we could just check if we need
> > > > to release, then unconditionally update.
> > > 
> > > Do you mean this?
> > > 
> > > if (rc->backref_cache.last_trans != trans->transid) {
> > > 	if (rc->backref_cache.last_trans)
> > > 		btrfs_backref_release_cache(&rc->backref_cache);
> > > 	rc->backref_cache.last_trans = trans->transid;
> > > }
> > 
> > Yeah, or even
> > 
> > if (rc->backref_cache.last_trans != trans->transid)
> >         btrfs_backref_release_cache(&rc->backref_cache);
> > // no-op if they are equal
> > rc->backref_cache.last_trans = trans->transid;
> 
> Ah yeah duh, thanks,
> 
> Josef