Re: [PATCH] btrfs: drop the backref cache during relocation if we commit

Josef Bacik <josef@xxxxxxxxxxxxxx> · Thu, 26 Sep 2024 10:31:33 -0400

On Wed, Sep 25, 2024 at 02:46:25PM -0700, Boris Burkov wrote:
> On Wed, Sep 25, 2024 at 04:17:34PM -0400, Josef Bacik wrote:
> > On Wed, Sep 25, 2024 at 12:07:43PM -0700, Boris Burkov wrote:
> > > On Tue, Sep 24, 2024 at 05:11:37PM -0400, Josef Bacik wrote:
> > > > Since the inception of relocation we have maintained the backref cache
> > > > across transaction commits, updating the backref cache with the new
> > > > bytenr whenever we COW'ed blocks that were in the cache, and then
> > > > updating their bytenr once we detected a transaction id change.
> > > > 
> > > > This works as long as we're only ever modifying blocks, not changing the
> > > > structure of the tree.
> > > > 
> > > > However relocation does in fact change the structure of the tree.  For
> > > > example, if we are relocating a data extent, we will look up all the
> > > > leaves that point to this data extent.  We will then call
> > > > do_relocation() on each of these leaves, which will COW down to the leaf
> > > > and then update the file extent location.
> > > > 
> > > > But, a key feature of do_relocation is the pending list.  This is all
> > > > the pending nodes that we modified when we updated the file extent item.
> > > > We will then process all of these blocks via finish_pending_nodes, which
> > > > calls do_relocation() on all of the nodes that led up to that leaf.
> > > > 
> > > > The purpose of this is to make sure we don't break sharing unless we
> > > > absolutely have to.  Consider the case that we have 3 snapshots that all
> > > > point to this leaf through the same nodes, the initial COW would have
> > > > created a whole new path.  If we did this for all 3 snapshots we would
> > > > end up with 3x the number of nodes we had originally.  To avoid this we
> > > > will cycle through each of the snapshots that point to each of these
> > > > nodes and update their pointers to point at the new nodes.
> > > > 
> > > > Once we update the pointer to the new node we will drop the node we
> > > > removed the link for and all of its children via btrfs_drop_subtree().
> > > > This is essentially just btrfs_drop_snapshot(), but for an arbitrary
> > > > point in the snapshot.
> > > > 
> > > > The problem with this is that we will never reflect this in the backref
> > > > cache.  If we do this btrfs_drop_snapshot() for a node that is in the
> > > > backref tree, we will leave the node in the backref tree.  This becomes
> > > > a problem when we change the transid, as now the backref cache has
> > > > entire subtrees that no longer exist, but exist as if they still are
> > > > pointed to by the same roots.
> > > > 
> > > > In the best case scenario you end up with "adding refs to an existing
> > > > tree ref" errors from insert_inline_extent_backref(), where we attempt
> > > > to link in nodes on roots that are no longer valid.
> > > > 
> > > > Worst case you will double free some random block and re-use it when
> > > > there's still references to the block.
> > > > 
> > > > This is extremely subtle, and the consequences are quite bad.  There
> > > > isn't a way to make sure our backref cache is consistent between
> > > > transid's.
> > > > 
> > > > In order to fix this we need to simply evict the entire backref cache
> > > > anytime we cross transid's.  This reduces performance in that we have to
> > > > rebuild this backref cache every time we change transid's, but fixes the
> > > > bug.
> > > > 
> > > > This has existed since relocation was added, and is a pretty critical
> > > > bug.  There's a lot more cleanup that can be done now that this
> > > > functionality is going away, but this patch is as small as possible in
> > > > order to fix the problem and make it easy for us to backport it to all
> > > > the kernels it needs to be backported to.
> > > > 
> > > > Followup series will dismantle more of this code and simplify relocation
> > > > drastically to remove this functionality.
> > > 
> > > +1
> > > 
> > > > 
> > > > We have a reproducer that reproduced the corruption within a few minutes
> > > > of running.  With this patch it survives several iterations/hours of
> > > > running the reproducer.
> > > > 
> > > > Fixes: 3fd0a5585eb9 ("Btrfs: Metadata ENOSPC handling for balance")
> > > > Cc: stable@xxxxxxxxxxxxxxx
> > > > Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
> > > > ---
> > > >  fs/btrfs/backref.c    | 12 ++++---
> > > >  fs/btrfs/relocation.c | 76 +++----------------------------------------
> > > >  2 files changed, 13 insertions(+), 75 deletions(-)
> > > > 
> > > > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> > > > index e2f478ecd7fd..f8e1d5b2c512 100644
> > > > --- a/fs/btrfs/backref.c
> > > > +++ b/fs/btrfs/backref.c
> > > > @@ -3179,10 +3179,14 @@ void btrfs_backref_release_cache(struct btrfs_backref_cache *cache)
> > > >  		btrfs_backref_cleanup_node(cache, node);
> > > >  	}
> > > >  
> > > > -	cache->last_trans = 0;
> > > > -
> > > > -	for (i = 0; i < BTRFS_MAX_LEVEL; i++)
> > > > -		ASSERT(list_empty(&cache->pending[i]));
> > > > +	for (i = 0; i < BTRFS_MAX_LEVEL; i++) {
> > > > +		while (!list_empty(&cache->pending[i])) {
> > > > +			node = list_first_entry(&cache->pending[i],
> > > > +						struct btrfs_backref_node,
> > > > +						list);
> > > > +			btrfs_backref_cleanup_node(cache, node);
> > > > +		}
> > > > +	}
> > > 
> > > Non blocking suggestion:
> > > 
> > > The fact that this cleanup has to keep an accurate view of the leaves
> > > while it runs feels like overkill. If we just maintained a linked list
> > > of all the nodes/edges we could call kfree on all of them. I think the
> > > existing rbtree of nodes would probably just work too?
> > > 
> > > I think just adding the pending instead of assuming it's empty makes
> > > sense, though.
> > 
> > This is one of the cleanups that's coming, because we keep a list of everything
> > in either ->leaves or ->detached, and now we can have them in the pending list.
> > The cleanup function assumes that everything that is handled at this point so
> > everything should be empty.
> > 
> > This is obviously silly, so the cleanup is to iterate the rbtree and free that
> > way and not have all this stuff.
> > 
> > We were only calling this after we were completely done with the loop for
> > backref cache, so the pending lists should have been empty if there was an error
> > and cleaned up before calling this function.
> > 
> > But now because we're calling it every time we have to make sure we evict the
> > pending list ourselves.
> > 
> > Again, this was done for simplicity sake, the problem is bad enough we want it
> > to be backported as widely as possible, so all other cleanup stuff is going to
> > be done separately.
> > 
> > > 
> > > >  	ASSERT(list_empty(&cache->pending_edge));
> > > >  	ASSERT(list_empty(&cache->useless_node));
> > > >  	ASSERT(list_empty(&cache->changed));
> > > > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> > > > index ea4ed85919ec..aaa9cac213f1 100644
> > > > --- a/fs/btrfs/relocation.c
> > > > +++ b/fs/btrfs/relocation.c
> > > > @@ -232,70 +232,6 @@ static struct btrfs_backref_node *walk_down_backref(
> > > >  	return NULL;
> > > >  }
> > > >  
> > > > -static void update_backref_node(struct btrfs_backref_cache *cache,
> > > > -				struct btrfs_backref_node *node, u64 bytenr)
> > > > -{
> > > > -	struct rb_node *rb_node;
> > > > -	rb_erase(&node->rb_node, &cache->rb_root);
> > > > -	node->bytenr = bytenr;
> > > > -	rb_node = rb_simple_insert(&cache->rb_root, node->bytenr, &node->rb_node);
> > > > -	if (rb_node)
> > > > -		btrfs_backref_panic(cache->fs_info, bytenr, -EEXIST);
> > > > -}
> > > > -
> > > > -/*
> > > > - * update backref cache after a transaction commit
> > > > - */
> > > > -static int update_backref_cache(struct btrfs_trans_handle *trans,
> > > > -				struct btrfs_backref_cache *cache)
> > > > -{
> > > > -	struct btrfs_backref_node *node;
> > > > -	int level = 0;
> > > > -
> > > > -	if (cache->last_trans == 0) {
> > > > -		cache->last_trans = trans->transid;
> > > > -		return 0;
> > > > -	}
> > > > -
> > > > -	if (cache->last_trans == trans->transid)
> > > > -		return 0;
> > > > -
> > > > -	/*
> > > > -	 * detached nodes are used to avoid unnecessary backref
> > > > -	 * lookup. transaction commit changes the extent tree.
> > > > -	 * so the detached nodes are no longer useful.
> > > > -	 */
> > > > -	while (!list_empty(&cache->detached)) {
> > > > -		node = list_entry(cache->detached.next,
> > > > -				  struct btrfs_backref_node, list);
> > > > -		btrfs_backref_cleanup_node(cache, node);
> > > > -	}
> > > > -
> > > > -	while (!list_empty(&cache->changed)) {
> > > > -		node = list_entry(cache->changed.next,
> > > > -				  struct btrfs_backref_node, list);
> > > > -		list_del_init(&node->list);
> > > > -		BUG_ON(node->pending);
> > > > -		update_backref_node(cache, node, node->new_bytenr);
> > > > -	}
> > > > -
> > > > -	/*
> > > > -	 * some nodes can be left in the pending list if there were
> > > > -	 * errors during processing the pending nodes.
> > > > -	 */
> > > > -	for (level = 0; level < BTRFS_MAX_LEVEL; level++) {
> > > > -		list_for_each_entry(node, &cache->pending[level], list) {
> > > > -			BUG_ON(!node->pending);
> > > > -			if (node->bytenr == node->new_bytenr)
> > > > -				continue;
> > > > -			update_backref_node(cache, node, node->new_bytenr);
> > > > -		}
> > > > -	}
> > > > -
> > > > -	cache->last_trans = 0;
> > > > -	return 1;
> > > > -}
> > > > -
> > > >  static bool reloc_root_is_dead(const struct btrfs_root *root)
> > > >  {
> > > >  	/*
> > > > @@ -551,9 +487,6 @@ static int clone_backref_node(struct btrfs_trans_handle *trans,
> > > >  	struct btrfs_backref_edge *new_edge;
> > > >  	struct rb_node *rb_node;
> > > >  
> > > > -	if (cache->last_trans > 0)
> > > > -		update_backref_cache(trans, cache);
> > > > -
> > > 
> > > This looks suspicious to me. You said in the commit message that we need
> > > to nuke the cache any time we cross a transid. However, here, you detect
> > > a changed transid s.t. we would be calling update (presumably for some
> > > reason someone felt was important) but now we are just skipping it.
> > > 
> > > Why is this correct/safe? Do we need to dump the cache here too? Are we
> > > never going to hit this because of some implicit synchronization between
> > > this post snapshot path and the cache-blowing-up you added to
> > > relocate_block_group?
> > > 
> > > And if we need to dump the cache here to be safe, are you sure the other
> > > places we can go into the relocation code from outside
> > > relocate_block_group are safe to interact with an old cache too? I'm
> > > thinking of btrfs_reloc_cow_block primarily.
> > 
> > This is an extremely subtle part.  clone_backref_node() is called when we create
> > the reloc root, which can happen out of band of the relocation, so outside of
> > the relocate_block_group loop.
> > 
> > This was made to work in conjunction with the updating logic, so we had to call
> > it here in order to make sure the code in clone found the right nodes, because
> > src->commit_root->start would have been in ->new_bytenr if we hadn't updated the
> > backref cache yet.
> > 
> > Now we don't care, if it happens out of band we actually don't want to keep
> > those cloned backref cache, so the relocate block group loop will evict the
> > cache and rebuild it anyway.
> > 
> > If this happens where it should, eg select_reloc_root, then the cache will be
> > uptodate and src->commit_root->start will match what exists in the backref
> > cache.
> > 
> > This is safe specifically because in the relocate block group loop we will evict
> > everything, and then any reloc root creation that happens once we have the
> > bacrekf cache in place will be synchronized appropriately.
> 
> I thought about it some more, and noticed that the clone_backref_node
> can only be called in the transaction critical section in
> create_pending_snapshot, while the btrfs_relocate_block_group can only
> run under a transaction, so if clone_backref_node runs, relocation is
> guaranteed to see a changed transid by the time it runs. Since the
> effects of clone_backref_node are contained entirely to the backref
> cache, which we are guaranteed is gonna be blown away next relocation
> step, I think we can just omit clone_backref_node entirely?

Yeah, it's getting deleted in followup patches.

> 
> Whether we want to get rid of it entirely or not, I am relatively
> convinced now that this aspect of the patch is fine.
> 
> On the other hand, I am now extremely concerned about the interaction
> between the new relocate_block_group and btrfs_reloc_cow_block...
> > 
> > > 
> > > >  	rb_node = rb_simple_search(&cache->rb_root, src->commit_root->start);
> > > >  	if (rb_node) {
> > > >  		node = rb_entry(rb_node, struct btrfs_backref_node, rb_node);
> > > > @@ -3698,10 +3631,11 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
> > > >  			break;
> > > >  		}
> > > >  restart:
> > > > -		if (update_backref_cache(trans, &rc->backref_cache)) {
> > > > -			btrfs_end_transaction(trans);
> > > > -			trans = NULL;
> > > > -			continue;
> > > > +		if (rc->backref_cache.last_trans == 0) {
> > > > +			rc->backref_cache.last_trans = trans->transid;
> > > > +		} else if (rc->backref_cache.last_trans != trans->transid) {
> > > > +			btrfs_backref_release_cache(&rc->backref_cache);
> 
> What prevents a race between this and btrfs_reloc_cow_block running
> node = rc->backref_cache.path[level];
> 
> I feel like I must be missing some higher level synchronization between
> the two, but if my suspicion is right, btrfs_reloc_cow_block would now
> grab a backref_cache node and use it while btrfs_backref_release_cache
> called kfree on it, probably resulting in UAF.

This is the bit you're missing

if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID && rc->create_reloc_tree) {

We only do this if we're the reloc root, and we only search down the reloc root
from relocation, so we know if we got here and we're the reloc root that
everything is setup properly.

> 
> > > > +			rc->backref_cache.last_trans = trans->transid;
> > > 
> > > Non blocking suggestion:
> > > This feels like it could be simpler if we initialized last_trans to 0
> > > after allocated the reloc_control, then we could just check if we need
> > > to release, then unconditionally update.
> > 
> > Do you mean this?
> > 
> > if (rc->backref_cache.last_trans != trans->transid) {
> > 	if (rc->backref_cache.last_trans)
> > 		btrfs_backref_release_cache(&rc->backref_cache);
> > 	rc->backref_cache.last_trans = trans->transid;
> > }
> 
> Yeah, or even
> 
> if (rc->backref_cache.last_trans != trans->transid)
>         btrfs_backref_release_cache(&rc->backref_cache);
> // no-op if they are equal
> rc->backref_cache.last_trans = trans->transid;

Ah yeah duh, thanks,

Josef