Re: [PATCH] btrfs: drop the backref cache during relocation if we commit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 25, 2024 at 04:17:34PM -0400, Josef Bacik wrote:
> On Wed, Sep 25, 2024 at 12:07:43PM -0700, Boris Burkov wrote:
> > On Tue, Sep 24, 2024 at 05:11:37PM -0400, Josef Bacik wrote:
> > > Since the inception of relocation we have maintained the backref cache
> > > across transaction commits, updating the backref cache with the new
> > > bytenr whenever we COW'ed blocks that were in the cache, and then
> > > updating their bytenr once we detected a transaction id change.
> > > 
> > > This works as long as we're only ever modifying blocks, not changing the
> > > structure of the tree.
> > > 
> > > However relocation does in fact change the structure of the tree.  For
> > > example, if we are relocating a data extent, we will look up all the
> > > leaves that point to this data extent.  We will then call
> > > do_relocation() on each of these leaves, which will COW down to the leaf
> > > and then update the file extent location.
> > > 
> > > But, a key feature of do_relocation is the pending list.  This is all
> > > the pending nodes that we modified when we updated the file extent item.
> > > We will then process all of these blocks via finish_pending_nodes, which
> > > calls do_relocation() on all of the nodes that led up to that leaf.
> > > 
> > > The purpose of this is to make sure we don't break sharing unless we
> > > absolutely have to.  Consider the case that we have 3 snapshots that all
> > > point to this leaf through the same nodes, the initial COW would have
> > > created a whole new path.  If we did this for all 3 snapshots we would
> > > end up with 3x the number of nodes we had originally.  To avoid this we
> > > will cycle through each of the snapshots that point to each of these
> > > nodes and update their pointers to point at the new nodes.
> > > 
> > > Once we update the pointer to the new node we will drop the node we
> > > removed the link for and all of its children via btrfs_drop_subtree().
> > > This is essentially just btrfs_drop_snapshot(), but for an arbitrary
> > > point in the snapshot.
> > > 
> > > The problem with this is that we will never reflect this in the backref
> > > cache.  If we do this btrfs_drop_snapshot() for a node that is in the
> > > backref tree, we will leave the node in the backref tree.  This becomes
> > > a problem when we change the transid, as now the backref cache has
> > > entire subtrees that no longer exist, but exist as if they still are
> > > pointed to by the same roots.
> > > 
> > > In the best case scenario you end up with "adding refs to an existing
> > > tree ref" errors from insert_inline_extent_backref(), where we attempt
> > > to link in nodes on roots that are no longer valid.
> > > 
> > > Worst case you will double free some random block and re-use it when
> > > there's still references to the block.
> > > 
> > > This is extremely subtle, and the consequences are quite bad.  There
> > > isn't a way to make sure our backref cache is consistent between
> > > transid's.
> > > 
> > > In order to fix this we need to simply evict the entire backref cache
> > > anytime we cross transid's.  This reduces performance in that we have to
> > > rebuild this backref cache every time we change transid's, but fixes the
> > > bug.
> > > 
> > > This has existed since relocation was added, and is a pretty critical
> > > bug.  There's a lot more cleanup that can be done now that this
> > > functionality is going away, but this patch is as small as possible in
> > > order to fix the problem and make it easy for us to backport it to all
> > > the kernels it needs to be backported to.
> > > 
> > > Followup series will dismantle more of this code and simplify relocation
> > > drastically to remove this functionality.
> > 
> > +1
> > 
> > > 
> > > We have a reproducer that reproduced the corruption within a few minutes
> > > of running.  With this patch it survives several iterations/hours of
> > > running the reproducer.
> > > 
> > > Fixes: 3fd0a5585eb9 ("Btrfs: Metadata ENOSPC handling for balance")
> > > Cc: stable@xxxxxxxxxxxxxxx
> > > Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
> > > ---
> > >  fs/btrfs/backref.c    | 12 ++++---
> > >  fs/btrfs/relocation.c | 76 +++----------------------------------------
> > >  2 files changed, 13 insertions(+), 75 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> > > index e2f478ecd7fd..f8e1d5b2c512 100644
> > > --- a/fs/btrfs/backref.c
> > > +++ b/fs/btrfs/backref.c
> > > @@ -3179,10 +3179,14 @@ void btrfs_backref_release_cache(struct btrfs_backref_cache *cache)
> > >  		btrfs_backref_cleanup_node(cache, node);
> > >  	}
> > >  
> > > -	cache->last_trans = 0;
> > > -
> > > -	for (i = 0; i < BTRFS_MAX_LEVEL; i++)
> > > -		ASSERT(list_empty(&cache->pending[i]));
> > > +	for (i = 0; i < BTRFS_MAX_LEVEL; i++) {
> > > +		while (!list_empty(&cache->pending[i])) {
> > > +			node = list_first_entry(&cache->pending[i],
> > > +						struct btrfs_backref_node,
> > > +						list);
> > > +			btrfs_backref_cleanup_node(cache, node);
> > > +		}
> > > +	}
> > 
> > Non blocking suggestion:
> > 
> > The fact that this cleanup has to keep an accurate view of the leaves
> > while it runs feels like overkill. If we just maintained a linked list
> > of all the nodes/edges we could call kfree on all of them. I think the
> > existing rbtree of nodes would probably just work too?
> > 
> > I think just adding the pending instead of assuming it's empty makes
> > sense, though.
> 
> This is one of the cleanups that's coming, because we keep a list of everything
> in either ->leaves or ->detached, and now we can have them in the pending list.
> The cleanup function assumes that everything that is handled at this point so
> everything should be empty.
> 
> This is obviously silly, so the cleanup is to iterate the rbtree and free that
> way and not have all this stuff.
> 
> We were only calling this after we were completely done with the loop for
> backref cache, so the pending lists should have been empty if there was an error
> and cleaned up before calling this function.
> 
> But now because we're calling it every time we have to make sure we evict the
> pending list ourselves.
> 
> Again, this was done for simplicity sake, the problem is bad enough we want it
> to be backported as widely as possible, so all other cleanup stuff is going to
> be done separately.
> 
> > 
> > >  	ASSERT(list_empty(&cache->pending_edge));
> > >  	ASSERT(list_empty(&cache->useless_node));
> > >  	ASSERT(list_empty(&cache->changed));
> > > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> > > index ea4ed85919ec..aaa9cac213f1 100644
> > > --- a/fs/btrfs/relocation.c
> > > +++ b/fs/btrfs/relocation.c
> > > @@ -232,70 +232,6 @@ static struct btrfs_backref_node *walk_down_backref(
> > >  	return NULL;
> > >  }
> > >  
> > > -static void update_backref_node(struct btrfs_backref_cache *cache,
> > > -				struct btrfs_backref_node *node, u64 bytenr)
> > > -{
> > > -	struct rb_node *rb_node;
> > > -	rb_erase(&node->rb_node, &cache->rb_root);
> > > -	node->bytenr = bytenr;
> > > -	rb_node = rb_simple_insert(&cache->rb_root, node->bytenr, &node->rb_node);
> > > -	if (rb_node)
> > > -		btrfs_backref_panic(cache->fs_info, bytenr, -EEXIST);
> > > -}
> > > -
> > > -/*
> > > - * update backref cache after a transaction commit
> > > - */
> > > -static int update_backref_cache(struct btrfs_trans_handle *trans,
> > > -				struct btrfs_backref_cache *cache)
> > > -{
> > > -	struct btrfs_backref_node *node;
> > > -	int level = 0;
> > > -
> > > -	if (cache->last_trans == 0) {
> > > -		cache->last_trans = trans->transid;
> > > -		return 0;
> > > -	}
> > > -
> > > -	if (cache->last_trans == trans->transid)
> > > -		return 0;
> > > -
> > > -	/*
> > > -	 * detached nodes are used to avoid unnecessary backref
> > > -	 * lookup. transaction commit changes the extent tree.
> > > -	 * so the detached nodes are no longer useful.
> > > -	 */
> > > -	while (!list_empty(&cache->detached)) {
> > > -		node = list_entry(cache->detached.next,
> > > -				  struct btrfs_backref_node, list);
> > > -		btrfs_backref_cleanup_node(cache, node);
> > > -	}
> > > -
> > > -	while (!list_empty(&cache->changed)) {
> > > -		node = list_entry(cache->changed.next,
> > > -				  struct btrfs_backref_node, list);
> > > -		list_del_init(&node->list);
> > > -		BUG_ON(node->pending);
> > > -		update_backref_node(cache, node, node->new_bytenr);
> > > -	}
> > > -
> > > -	/*
> > > -	 * some nodes can be left in the pending list if there were
> > > -	 * errors during processing the pending nodes.
> > > -	 */
> > > -	for (level = 0; level < BTRFS_MAX_LEVEL; level++) {
> > > -		list_for_each_entry(node, &cache->pending[level], list) {
> > > -			BUG_ON(!node->pending);
> > > -			if (node->bytenr == node->new_bytenr)
> > > -				continue;
> > > -			update_backref_node(cache, node, node->new_bytenr);
> > > -		}
> > > -	}
> > > -
> > > -	cache->last_trans = 0;
> > > -	return 1;
> > > -}
> > > -
> > >  static bool reloc_root_is_dead(const struct btrfs_root *root)
> > >  {
> > >  	/*
> > > @@ -551,9 +487,6 @@ static int clone_backref_node(struct btrfs_trans_handle *trans,
> > >  	struct btrfs_backref_edge *new_edge;
> > >  	struct rb_node *rb_node;
> > >  
> > > -	if (cache->last_trans > 0)
> > > -		update_backref_cache(trans, cache);
> > > -
> > 
> > This looks suspicious to me. You said in the commit message that we need
> > to nuke the cache any time we cross a transid. However, here, you detect
> > a changed transid s.t. we would be calling update (presumably for some
> > reason someone felt was important) but now we are just skipping it.
> > 
> > Why is this correct/safe? Do we need to dump the cache here too? Are we
> > never going to hit this because of some implicit synchronization between
> > this post snapshot path and the cache-blowing-up you added to
> > relocate_block_group?
> > 
> > And if we need to dump the cache here to be safe, are you sure the other
> > places we can go into the relocation code from outside
> > relocate_block_group are safe to interact with an old cache too? I'm
> > thinking of btrfs_reloc_cow_block primarily.
> 
> This is an extremely subtle part.  clone_backref_node() is called when we create
> the reloc root, which can happen out of band of the relocation, so outside of
> the relocate_block_group loop.
> 
> This was made to work in conjunction with the updating logic, so we had to call
> it here in order to make sure the code in clone found the right nodes, because
> src->commit_root->start would have been in ->new_bytenr if we hadn't updated the
> backref cache yet.
> 
> Now we don't care, if it happens out of band we actually don't want to keep
> those cloned backref cache, so the relocate block group loop will evict the
> cache and rebuild it anyway.
> 
> If this happens where it should, eg select_reloc_root, then the cache will be
> uptodate and src->commit_root->start will match what exists in the backref
> cache.
> 
> This is safe specifically because in the relocate block group loop we will evict
> everything, and then any reloc root creation that happens once we have the
> bacrekf cache in place will be synchronized appropriately.

I thought about it some more, and noticed that the clone_backref_node
can only be called in the transaction critical section in
create_pending_snapshot, while the btrfs_relocate_block_group can only
run under a transaction, so if clone_backref_node runs, relocation is
guaranteed to see a changed transid by the time it runs. Since the
effects of clone_backref_node are contained entirely to the backref
cache, which we are guaranteed is gonna be blown away next relocation
step, I think we can just omit clone_backref_node entirely?

Whether we want to get rid of it entirely or not, I am relatively
convinced now that this aspect of the patch is fine.

On the other hand, I am now extremely concerned about the interaction
between the new relocate_block_group and btrfs_reloc_cow_block...
> 
> > 
> > >  	rb_node = rb_simple_search(&cache->rb_root, src->commit_root->start);
> > >  	if (rb_node) {
> > >  		node = rb_entry(rb_node, struct btrfs_backref_node, rb_node);
> > > @@ -3698,10 +3631,11 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
> > >  			break;
> > >  		}
> > >  restart:
> > > -		if (update_backref_cache(trans, &rc->backref_cache)) {
> > > -			btrfs_end_transaction(trans);
> > > -			trans = NULL;
> > > -			continue;
> > > +		if (rc->backref_cache.last_trans == 0) {
> > > +			rc->backref_cache.last_trans = trans->transid;
> > > +		} else if (rc->backref_cache.last_trans != trans->transid) {
> > > +			btrfs_backref_release_cache(&rc->backref_cache);

What prevents a race between this and btrfs_reloc_cow_block running
node = rc->backref_cache.path[level];

I feel like I must be missing some higher level synchronization between
the two, but if my suspicion is right, btrfs_reloc_cow_block would now
grab a backref_cache node and use it while btrfs_backref_release_cache
called kfree on it, probably resulting in UAF.

> > > +			rc->backref_cache.last_trans = trans->transid;
> > 
> > Non blocking suggestion:
> > This feels like it could be simpler if we initialized last_trans to 0
> > after allocated the reloc_control, then we could just check if we need
> > to release, then unconditionally update.
> 
> Do you mean this?
> 
> if (rc->backref_cache.last_trans != trans->transid) {
> 	if (rc->backref_cache.last_trans)
> 		btrfs_backref_release_cache(&rc->backref_cache);
> 	rc->backref_cache.last_trans = trans->transid;
> }

Yeah, or even

if (rc->backref_cache.last_trans != trans->transid)
        btrfs_backref_release_cache(&rc->backref_cache);
// no-op if they are equal
rc->backref_cache.last_trans = trans->transid;

> 
> Thanks,
> 
> Josef




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux