Re: [PATCH] btrfs: drop the backref cache during relocation if we commit

Boris Burkov <boris@xxxxxx> · Wed, 25 Sep 2024 12:07:43 -0700

On Tue, Sep 24, 2024 at 05:11:37PM -0400, Josef Bacik wrote:
> Since the inception of relocation we have maintained the backref cache
> across transaction commits, updating the backref cache with the new
> bytenr whenever we COW'ed blocks that were in the cache, and then
> updating their bytenr once we detected a transaction id change.
> 
> This works as long as we're only ever modifying blocks, not changing the
> structure of the tree.
> 
> However relocation does in fact change the structure of the tree.  For
> example, if we are relocating a data extent, we will look up all the
> leaves that point to this data extent.  We will then call
> do_relocation() on each of these leaves, which will COW down to the leaf
> and then update the file extent location.
> 
> But, a key feature of do_relocation is the pending list.  This is all
> the pending nodes that we modified when we updated the file extent item.
> We will then process all of these blocks via finish_pending_nodes, which
> calls do_relocation() on all of the nodes that led up to that leaf.
> 
> The purpose of this is to make sure we don't break sharing unless we
> absolutely have to.  Consider the case that we have 3 snapshots that all
> point to this leaf through the same nodes, the initial COW would have
> created a whole new path.  If we did this for all 3 snapshots we would
> end up with 3x the number of nodes we had originally.  To avoid this we
> will cycle through each of the snapshots that point to each of these
> nodes and update their pointers to point at the new nodes.
> 
> Once we update the pointer to the new node we will drop the node we
> removed the link for and all of its children via btrfs_drop_subtree().
> This is essentially just btrfs_drop_snapshot(), but for an arbitrary
> point in the snapshot.
> 
> The problem with this is that we will never reflect this in the backref
> cache.  If we do this btrfs_drop_snapshot() for a node that is in the
> backref tree, we will leave the node in the backref tree.  This becomes
> a problem when we change the transid, as now the backref cache has
> entire subtrees that no longer exist, but exist as if they still are
> pointed to by the same roots.
> 
> In the best case scenario you end up with "adding refs to an existing
> tree ref" errors from insert_inline_extent_backref(), where we attempt
> to link in nodes on roots that are no longer valid.
> 
> Worst case you will double free some random block and re-use it when
> there's still references to the block.
> 
> This is extremely subtle, and the consequences are quite bad.  There
> isn't a way to make sure our backref cache is consistent between
> transid's.
> 
> In order to fix this we need to simply evict the entire backref cache
> anytime we cross transid's.  This reduces performance in that we have to
> rebuild this backref cache every time we change transid's, but fixes the
> bug.
> 
> This has existed since relocation was added, and is a pretty critical
> bug.  There's a lot more cleanup that can be done now that this
> functionality is going away, but this patch is as small as possible in
> order to fix the problem and make it easy for us to backport it to all
> the kernels it needs to be backported to.
> 
> Followup series will dismantle more of this code and simplify relocation
> drastically to remove this functionality.

+1

> 
> We have a reproducer that reproduced the corruption within a few minutes
> of running.  With this patch it survives several iterations/hours of
> running the reproducer.
> 
> Fixes: 3fd0a5585eb9 ("Btrfs: Metadata ENOSPC handling for balance")
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
> ---
>  fs/btrfs/backref.c    | 12 ++++---
>  fs/btrfs/relocation.c | 76 +++----------------------------------------
>  2 files changed, 13 insertions(+), 75 deletions(-)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index e2f478ecd7fd..f8e1d5b2c512 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -3179,10 +3179,14 @@ void btrfs_backref_release_cache(struct btrfs_backref_cache *cache)
>  		btrfs_backref_cleanup_node(cache, node);
>  	}
>  
> -	cache->last_trans = 0;
> -
> -	for (i = 0; i < BTRFS_MAX_LEVEL; i++)
> -		ASSERT(list_empty(&cache->pending[i]));
> +	for (i = 0; i < BTRFS_MAX_LEVEL; i++) {
> +		while (!list_empty(&cache->pending[i])) {
> +			node = list_first_entry(&cache->pending[i],
> +						struct btrfs_backref_node,
> +						list);
> +			btrfs_backref_cleanup_node(cache, node);
> +		}
> +	}

Non blocking suggestion:

The fact that this cleanup has to keep an accurate view of the leaves
while it runs feels like overkill. If we just maintained a linked list
of all the nodes/edges we could call kfree on all of them. I think the
existing rbtree of nodes would probably just work too?

I think just adding the pending instead of assuming it's empty makes
sense, though.

>  	ASSERT(list_empty(&cache->pending_edge));
>  	ASSERT(list_empty(&cache->useless_node));
>  	ASSERT(list_empty(&cache->changed));
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index ea4ed85919ec..aaa9cac213f1 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -232,70 +232,6 @@ static struct btrfs_backref_node *walk_down_backref(
>  	return NULL;
>  }
>  
> -static void update_backref_node(struct btrfs_backref_cache *cache,
> -				struct btrfs_backref_node *node, u64 bytenr)
> -{
> -	struct rb_node *rb_node;
> -	rb_erase(&node->rb_node, &cache->rb_root);
> -	node->bytenr = bytenr;
> -	rb_node = rb_simple_insert(&cache->rb_root, node->bytenr, &node->rb_node);
> -	if (rb_node)
> -		btrfs_backref_panic(cache->fs_info, bytenr, -EEXIST);
> -}
> -
> -/*
> - * update backref cache after a transaction commit
> - */
> -static int update_backref_cache(struct btrfs_trans_handle *trans,
> -				struct btrfs_backref_cache *cache)
> -{
> -	struct btrfs_backref_node *node;
> -	int level = 0;
> -
> -	if (cache->last_trans == 0) {
> -		cache->last_trans = trans->transid;
> -		return 0;
> -	}
> -
> -	if (cache->last_trans == trans->transid)
> -		return 0;
> -
> -	/*
> -	 * detached nodes are used to avoid unnecessary backref
> -	 * lookup. transaction commit changes the extent tree.
> -	 * so the detached nodes are no longer useful.
> -	 */
> -	while (!list_empty(&cache->detached)) {
> -		node = list_entry(cache->detached.next,
> -				  struct btrfs_backref_node, list);
> -		btrfs_backref_cleanup_node(cache, node);
> -	}
> -
> -	while (!list_empty(&cache->changed)) {
> -		node = list_entry(cache->changed.next,
> -				  struct btrfs_backref_node, list);
> -		list_del_init(&node->list);
> -		BUG_ON(node->pending);
> -		update_backref_node(cache, node, node->new_bytenr);
> -	}
> -
> -	/*
> -	 * some nodes can be left in the pending list if there were
> -	 * errors during processing the pending nodes.
> -	 */
> -	for (level = 0; level < BTRFS_MAX_LEVEL; level++) {
> -		list_for_each_entry(node, &cache->pending[level], list) {
> -			BUG_ON(!node->pending);
> -			if (node->bytenr == node->new_bytenr)
> -				continue;
> -			update_backref_node(cache, node, node->new_bytenr);
> -		}
> -	}
> -
> -	cache->last_trans = 0;
> -	return 1;
> -}
> -
>  static bool reloc_root_is_dead(const struct btrfs_root *root)
>  {
>  	/*
> @@ -551,9 +487,6 @@ static int clone_backref_node(struct btrfs_trans_handle *trans,
>  	struct btrfs_backref_edge *new_edge;
>  	struct rb_node *rb_node;
>  
> -	if (cache->last_trans > 0)
> -		update_backref_cache(trans, cache);
> -

This looks suspicious to me. You said in the commit message that we need
to nuke the cache any time we cross a transid. However, here, you detect
a changed transid s.t. we would be calling update (presumably for some
reason someone felt was important) but now we are just skipping it.

Why is this correct/safe? Do we need to dump the cache here too? Are we
never going to hit this because of some implicit synchronization between
this post snapshot path and the cache-blowing-up you added to
relocate_block_group?

And if we need to dump the cache here to be safe, are you sure the other
places we can go into the relocation code from outside
relocate_block_group are safe to interact with an old cache too? I'm
thinking of btrfs_reloc_cow_block primarily.

>  	rb_node = rb_simple_search(&cache->rb_root, src->commit_root->start);
>  	if (rb_node) {
>  		node = rb_entry(rb_node, struct btrfs_backref_node, rb_node);
> @@ -3698,10 +3631,11 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
>  			break;
>  		}
>  restart:
> -		if (update_backref_cache(trans, &rc->backref_cache)) {
> -			btrfs_end_transaction(trans);
> -			trans = NULL;
> -			continue;
> +		if (rc->backref_cache.last_trans == 0) {
> +			rc->backref_cache.last_trans = trans->transid;
> +		} else if (rc->backref_cache.last_trans != trans->transid) {
> +			btrfs_backref_release_cache(&rc->backref_cache);
> +			rc->backref_cache.last_trans = trans->transid;

Non blocking suggestion:
This feels like it could be simpler if we initialized last_trans to 0
after allocated the reloc_control, then we could just check if we need
to release, then unconditionally update.

>  		}
>  
>  		ret = find_next_extent(rc, path, &key);
> -- 
> 2.43.0
>