On Tue, Sep 24, 2024 at 05:11:37PM -0400, Josef Bacik wrote: > Since the inception of relocation we have maintained the backref cache > across transaction commits, updating the backref cache with the new > bytenr whenever we COW'ed blocks that were in the cache, and then > updating their bytenr once we detected a transaction id change. > > This works as long as we're only ever modifying blocks, not changing the > structure of the tree. > > However relocation does in fact change the structure of the tree. For > example, if we are relocating a data extent, we will look up all the > leaves that point to this data extent. We will then call > do_relocation() on each of these leaves, which will COW down to the leaf > and then update the file extent location. > > But, a key feature of do_relocation is the pending list. This is all > the pending nodes that we modified when we updated the file extent item. > We will then process all of these blocks via finish_pending_nodes, which > calls do_relocation() on all of the nodes that led up to that leaf. > > The purpose of this is to make sure we don't break sharing unless we > absolutely have to. Consider the case that we have 3 snapshots that all > point to this leaf through the same nodes, the initial COW would have > created a whole new path. If we did this for all 3 snapshots we would > end up with 3x the number of nodes we had originally. To avoid this we > will cycle through each of the snapshots that point to each of these > nodes and update their pointers to point at the new nodes. > > Once we update the pointer to the new node we will drop the node we > removed the link for and all of its children via btrfs_drop_subtree(). > This is essentially just btrfs_drop_snapshot(), but for an arbitrary > point in the snapshot. > > The problem with this is that we will never reflect this in the backref > cache. If we do this btrfs_drop_snapshot() for a node that is in the > backref tree, we will leave the node in the backref tree. This becomes > a problem when we change the transid, as now the backref cache has > entire subtrees that no longer exist, but exist as if they still are > pointed to by the same roots. > > In the best case scenario you end up with "adding refs to an existing > tree ref" errors from insert_inline_extent_backref(), where we attempt > to link in nodes on roots that are no longer valid. > > Worst case you will double free some random block and re-use it when > there's still references to the block. > > This is extremely subtle, and the consequences are quite bad. There > isn't a way to make sure our backref cache is consistent between > transid's. > > In order to fix this we need to simply evict the entire backref cache > anytime we cross transid's. This reduces performance in that we have to > rebuild this backref cache every time we change transid's, but fixes the > bug. > > This has existed since relocation was added, and is a pretty critical > bug. There's a lot more cleanup that can be done now that this > functionality is going away, but this patch is as small as possible in > order to fix the problem and make it easy for us to backport it to all > the kernels it needs to be backported to. > > Followup series will dismantle more of this code and simplify relocation > drastically to remove this functionality. +1 > > We have a reproducer that reproduced the corruption within a few minutes > of running. With this patch it survives several iterations/hours of > running the reproducer. > > Fixes: 3fd0a5585eb9 ("Btrfs: Metadata ENOSPC handling for balance") > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx> > --- > fs/btrfs/backref.c | 12 ++++--- > fs/btrfs/relocation.c | 76 +++---------------------------------------- > 2 files changed, 13 insertions(+), 75 deletions(-) > > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c > index e2f478ecd7fd..f8e1d5b2c512 100644 > --- a/fs/btrfs/backref.c > +++ b/fs/btrfs/backref.c > @@ -3179,10 +3179,14 @@ void btrfs_backref_release_cache(struct btrfs_backref_cache *cache) > btrfs_backref_cleanup_node(cache, node); > } > > - cache->last_trans = 0; > - > - for (i = 0; i < BTRFS_MAX_LEVEL; i++) > - ASSERT(list_empty(&cache->pending[i])); > + for (i = 0; i < BTRFS_MAX_LEVEL; i++) { > + while (!list_empty(&cache->pending[i])) { > + node = list_first_entry(&cache->pending[i], > + struct btrfs_backref_node, > + list); > + btrfs_backref_cleanup_node(cache, node); > + } > + } Non blocking suggestion: The fact that this cleanup has to keep an accurate view of the leaves while it runs feels like overkill. If we just maintained a linked list of all the nodes/edges we could call kfree on all of them. I think the existing rbtree of nodes would probably just work too? I think just adding the pending instead of assuming it's empty makes sense, though. > ASSERT(list_empty(&cache->pending_edge)); > ASSERT(list_empty(&cache->useless_node)); > ASSERT(list_empty(&cache->changed)); > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c > index ea4ed85919ec..aaa9cac213f1 100644 > --- a/fs/btrfs/relocation.c > +++ b/fs/btrfs/relocation.c > @@ -232,70 +232,6 @@ static struct btrfs_backref_node *walk_down_backref( > return NULL; > } > > -static void update_backref_node(struct btrfs_backref_cache *cache, > - struct btrfs_backref_node *node, u64 bytenr) > -{ > - struct rb_node *rb_node; > - rb_erase(&node->rb_node, &cache->rb_root); > - node->bytenr = bytenr; > - rb_node = rb_simple_insert(&cache->rb_root, node->bytenr, &node->rb_node); > - if (rb_node) > - btrfs_backref_panic(cache->fs_info, bytenr, -EEXIST); > -} > - > -/* > - * update backref cache after a transaction commit > - */ > -static int update_backref_cache(struct btrfs_trans_handle *trans, > - struct btrfs_backref_cache *cache) > -{ > - struct btrfs_backref_node *node; > - int level = 0; > - > - if (cache->last_trans == 0) { > - cache->last_trans = trans->transid; > - return 0; > - } > - > - if (cache->last_trans == trans->transid) > - return 0; > - > - /* > - * detached nodes are used to avoid unnecessary backref > - * lookup. transaction commit changes the extent tree. > - * so the detached nodes are no longer useful. > - */ > - while (!list_empty(&cache->detached)) { > - node = list_entry(cache->detached.next, > - struct btrfs_backref_node, list); > - btrfs_backref_cleanup_node(cache, node); > - } > - > - while (!list_empty(&cache->changed)) { > - node = list_entry(cache->changed.next, > - struct btrfs_backref_node, list); > - list_del_init(&node->list); > - BUG_ON(node->pending); > - update_backref_node(cache, node, node->new_bytenr); > - } > - > - /* > - * some nodes can be left in the pending list if there were > - * errors during processing the pending nodes. > - */ > - for (level = 0; level < BTRFS_MAX_LEVEL; level++) { > - list_for_each_entry(node, &cache->pending[level], list) { > - BUG_ON(!node->pending); > - if (node->bytenr == node->new_bytenr) > - continue; > - update_backref_node(cache, node, node->new_bytenr); > - } > - } > - > - cache->last_trans = 0; > - return 1; > -} > - > static bool reloc_root_is_dead(const struct btrfs_root *root) > { > /* > @@ -551,9 +487,6 @@ static int clone_backref_node(struct btrfs_trans_handle *trans, > struct btrfs_backref_edge *new_edge; > struct rb_node *rb_node; > > - if (cache->last_trans > 0) > - update_backref_cache(trans, cache); > - This looks suspicious to me. You said in the commit message that we need to nuke the cache any time we cross a transid. However, here, you detect a changed transid s.t. we would be calling update (presumably for some reason someone felt was important) but now we are just skipping it. Why is this correct/safe? Do we need to dump the cache here too? Are we never going to hit this because of some implicit synchronization between this post snapshot path and the cache-blowing-up you added to relocate_block_group? And if we need to dump the cache here to be safe, are you sure the other places we can go into the relocation code from outside relocate_block_group are safe to interact with an old cache too? I'm thinking of btrfs_reloc_cow_block primarily. > rb_node = rb_simple_search(&cache->rb_root, src->commit_root->start); > if (rb_node) { > node = rb_entry(rb_node, struct btrfs_backref_node, rb_node); > @@ -3698,10 +3631,11 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc) > break; > } > restart: > - if (update_backref_cache(trans, &rc->backref_cache)) { > - btrfs_end_transaction(trans); > - trans = NULL; > - continue; > + if (rc->backref_cache.last_trans == 0) { > + rc->backref_cache.last_trans = trans->transid; > + } else if (rc->backref_cache.last_trans != trans->transid) { > + btrfs_backref_release_cache(&rc->backref_cache); > + rc->backref_cache.last_trans = trans->transid; Non blocking suggestion: This feels like it could be simpler if we initialized last_trans to 0 after allocated the reloc_control, then we could just check if we need to release, then unconditionally update. > } > > ret = find_next_extent(rc, path, &key); > -- > 2.43.0 >