On Wed, Sep 25, 2024 at 04:17:34PM -0400, Josef Bacik wrote: > On Wed, Sep 25, 2024 at 12:07:43PM -0700, Boris Burkov wrote: > > On Tue, Sep 24, 2024 at 05:11:37PM -0400, Josef Bacik wrote: > > > Since the inception of relocation we have maintained the backref cache > > > across transaction commits, updating the backref cache with the new > > > bytenr whenever we COW'ed blocks that were in the cache, and then > > > updating their bytenr once we detected a transaction id change. > > > > > > This works as long as we're only ever modifying blocks, not changing the > > > structure of the tree. > > > > > > However relocation does in fact change the structure of the tree. For > > > example, if we are relocating a data extent, we will look up all the > > > leaves that point to this data extent. We will then call > > > do_relocation() on each of these leaves, which will COW down to the leaf > > > and then update the file extent location. > > > > > > But, a key feature of do_relocation is the pending list. This is all > > > the pending nodes that we modified when we updated the file extent item. > > > We will then process all of these blocks via finish_pending_nodes, which > > > calls do_relocation() on all of the nodes that led up to that leaf. > > > > > > The purpose of this is to make sure we don't break sharing unless we > > > absolutely have to. Consider the case that we have 3 snapshots that all > > > point to this leaf through the same nodes, the initial COW would have > > > created a whole new path. If we did this for all 3 snapshots we would > > > end up with 3x the number of nodes we had originally. To avoid this we > > > will cycle through each of the snapshots that point to each of these > > > nodes and update their pointers to point at the new nodes. > > > > > > Once we update the pointer to the new node we will drop the node we > > > removed the link for and all of its children via btrfs_drop_subtree(). > > > This is essentially just btrfs_drop_snapshot(), but for an arbitrary > > > point in the snapshot. > > > > > > The problem with this is that we will never reflect this in the backref > > > cache. If we do this btrfs_drop_snapshot() for a node that is in the > > > backref tree, we will leave the node in the backref tree. This becomes > > > a problem when we change the transid, as now the backref cache has > > > entire subtrees that no longer exist, but exist as if they still are > > > pointed to by the same roots. > > > > > > In the best case scenario you end up with "adding refs to an existing > > > tree ref" errors from insert_inline_extent_backref(), where we attempt > > > to link in nodes on roots that are no longer valid. > > > > > > Worst case you will double free some random block and re-use it when > > > there's still references to the block. > > > > > > This is extremely subtle, and the consequences are quite bad. There > > > isn't a way to make sure our backref cache is consistent between > > > transid's. > > > > > > In order to fix this we need to simply evict the entire backref cache > > > anytime we cross transid's. This reduces performance in that we have to > > > rebuild this backref cache every time we change transid's, but fixes the > > > bug. > > > > > > This has existed since relocation was added, and is a pretty critical > > > bug. There's a lot more cleanup that can be done now that this > > > functionality is going away, but this patch is as small as possible in > > > order to fix the problem and make it easy for us to backport it to all > > > the kernels it needs to be backported to. > > > > > > Followup series will dismantle more of this code and simplify relocation > > > drastically to remove this functionality. > > > > +1 > > > > > > > > We have a reproducer that reproduced the corruption within a few minutes > > > of running. With this patch it survives several iterations/hours of > > > running the reproducer. > > > > > > Fixes: 3fd0a5585eb9 ("Btrfs: Metadata ENOSPC handling for balance") > > > Cc: stable@xxxxxxxxxxxxxxx > > > Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx> > > > --- > > > fs/btrfs/backref.c | 12 ++++--- > > > fs/btrfs/relocation.c | 76 +++---------------------------------------- > > > 2 files changed, 13 insertions(+), 75 deletions(-) > > > > > > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c > > > index e2f478ecd7fd..f8e1d5b2c512 100644 > > > --- a/fs/btrfs/backref.c > > > +++ b/fs/btrfs/backref.c > > > @@ -3179,10 +3179,14 @@ void btrfs_backref_release_cache(struct btrfs_backref_cache *cache) > > > btrfs_backref_cleanup_node(cache, node); > > > } > > > > > > - cache->last_trans = 0; > > > - > > > - for (i = 0; i < BTRFS_MAX_LEVEL; i++) > > > - ASSERT(list_empty(&cache->pending[i])); > > > + for (i = 0; i < BTRFS_MAX_LEVEL; i++) { > > > + while (!list_empty(&cache->pending[i])) { > > > + node = list_first_entry(&cache->pending[i], > > > + struct btrfs_backref_node, > > > + list); > > > + btrfs_backref_cleanup_node(cache, node); > > > + } > > > + } > > > > Non blocking suggestion: > > > > The fact that this cleanup has to keep an accurate view of the leaves > > while it runs feels like overkill. If we just maintained a linked list > > of all the nodes/edges we could call kfree on all of them. I think the > > existing rbtree of nodes would probably just work too? > > > > I think just adding the pending instead of assuming it's empty makes > > sense, though. > > This is one of the cleanups that's coming, because we keep a list of everything > in either ->leaves or ->detached, and now we can have them in the pending list. > The cleanup function assumes that everything that is handled at this point so > everything should be empty. > > This is obviously silly, so the cleanup is to iterate the rbtree and free that > way and not have all this stuff. > > We were only calling this after we were completely done with the loop for > backref cache, so the pending lists should have been empty if there was an error > and cleaned up before calling this function. > > But now because we're calling it every time we have to make sure we evict the > pending list ourselves. > > Again, this was done for simplicity sake, the problem is bad enough we want it > to be backported as widely as possible, so all other cleanup stuff is going to > be done separately. > > > > > > ASSERT(list_empty(&cache->pending_edge)); > > > ASSERT(list_empty(&cache->useless_node)); > > > ASSERT(list_empty(&cache->changed)); > > > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c > > > index ea4ed85919ec..aaa9cac213f1 100644 > > > --- a/fs/btrfs/relocation.c > > > +++ b/fs/btrfs/relocation.c > > > @@ -232,70 +232,6 @@ static struct btrfs_backref_node *walk_down_backref( > > > return NULL; > > > } > > > > > > -static void update_backref_node(struct btrfs_backref_cache *cache, > > > - struct btrfs_backref_node *node, u64 bytenr) > > > -{ > > > - struct rb_node *rb_node; > > > - rb_erase(&node->rb_node, &cache->rb_root); > > > - node->bytenr = bytenr; > > > - rb_node = rb_simple_insert(&cache->rb_root, node->bytenr, &node->rb_node); > > > - if (rb_node) > > > - btrfs_backref_panic(cache->fs_info, bytenr, -EEXIST); > > > -} > > > - > > > -/* > > > - * update backref cache after a transaction commit > > > - */ > > > -static int update_backref_cache(struct btrfs_trans_handle *trans, > > > - struct btrfs_backref_cache *cache) > > > -{ > > > - struct btrfs_backref_node *node; > > > - int level = 0; > > > - > > > - if (cache->last_trans == 0) { > > > - cache->last_trans = trans->transid; > > > - return 0; > > > - } > > > - > > > - if (cache->last_trans == trans->transid) > > > - return 0; > > > - > > > - /* > > > - * detached nodes are used to avoid unnecessary backref > > > - * lookup. transaction commit changes the extent tree. > > > - * so the detached nodes are no longer useful. > > > - */ > > > - while (!list_empty(&cache->detached)) { > > > - node = list_entry(cache->detached.next, > > > - struct btrfs_backref_node, list); > > > - btrfs_backref_cleanup_node(cache, node); > > > - } > > > - > > > - while (!list_empty(&cache->changed)) { > > > - node = list_entry(cache->changed.next, > > > - struct btrfs_backref_node, list); > > > - list_del_init(&node->list); > > > - BUG_ON(node->pending); > > > - update_backref_node(cache, node, node->new_bytenr); > > > - } > > > - > > > - /* > > > - * some nodes can be left in the pending list if there were > > > - * errors during processing the pending nodes. > > > - */ > > > - for (level = 0; level < BTRFS_MAX_LEVEL; level++) { > > > - list_for_each_entry(node, &cache->pending[level], list) { > > > - BUG_ON(!node->pending); > > > - if (node->bytenr == node->new_bytenr) > > > - continue; > > > - update_backref_node(cache, node, node->new_bytenr); > > > - } > > > - } > > > - > > > - cache->last_trans = 0; > > > - return 1; > > > -} > > > - > > > static bool reloc_root_is_dead(const struct btrfs_root *root) > > > { > > > /* > > > @@ -551,9 +487,6 @@ static int clone_backref_node(struct btrfs_trans_handle *trans, > > > struct btrfs_backref_edge *new_edge; > > > struct rb_node *rb_node; > > > > > > - if (cache->last_trans > 0) > > > - update_backref_cache(trans, cache); > > > - > > > > This looks suspicious to me. You said in the commit message that we need > > to nuke the cache any time we cross a transid. However, here, you detect > > a changed transid s.t. we would be calling update (presumably for some > > reason someone felt was important) but now we are just skipping it. > > > > Why is this correct/safe? Do we need to dump the cache here too? Are we > > never going to hit this because of some implicit synchronization between > > this post snapshot path and the cache-blowing-up you added to > > relocate_block_group? > > > > And if we need to dump the cache here to be safe, are you sure the other > > places we can go into the relocation code from outside > > relocate_block_group are safe to interact with an old cache too? I'm > > thinking of btrfs_reloc_cow_block primarily. > > This is an extremely subtle part. clone_backref_node() is called when we create > the reloc root, which can happen out of band of the relocation, so outside of > the relocate_block_group loop. > > This was made to work in conjunction with the updating logic, so we had to call > it here in order to make sure the code in clone found the right nodes, because > src->commit_root->start would have been in ->new_bytenr if we hadn't updated the > backref cache yet. > > Now we don't care, if it happens out of band we actually don't want to keep > those cloned backref cache, so the relocate block group loop will evict the > cache and rebuild it anyway. > > If this happens where it should, eg select_reloc_root, then the cache will be > uptodate and src->commit_root->start will match what exists in the backref > cache. > > This is safe specifically because in the relocate block group loop we will evict > everything, and then any reloc root creation that happens once we have the > bacrekf cache in place will be synchronized appropriately. I thought about it some more, and noticed that the clone_backref_node can only be called in the transaction critical section in create_pending_snapshot, while the btrfs_relocate_block_group can only run under a transaction, so if clone_backref_node runs, relocation is guaranteed to see a changed transid by the time it runs. Since the effects of clone_backref_node are contained entirely to the backref cache, which we are guaranteed is gonna be blown away next relocation step, I think we can just omit clone_backref_node entirely? Whether we want to get rid of it entirely or not, I am relatively convinced now that this aspect of the patch is fine. On the other hand, I am now extremely concerned about the interaction between the new relocate_block_group and btrfs_reloc_cow_block... > > > > > > rb_node = rb_simple_search(&cache->rb_root, src->commit_root->start); > > > if (rb_node) { > > > node = rb_entry(rb_node, struct btrfs_backref_node, rb_node); > > > @@ -3698,10 +3631,11 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc) > > > break; > > > } > > > restart: > > > - if (update_backref_cache(trans, &rc->backref_cache)) { > > > - btrfs_end_transaction(trans); > > > - trans = NULL; > > > - continue; > > > + if (rc->backref_cache.last_trans == 0) { > > > + rc->backref_cache.last_trans = trans->transid; > > > + } else if (rc->backref_cache.last_trans != trans->transid) { > > > + btrfs_backref_release_cache(&rc->backref_cache); What prevents a race between this and btrfs_reloc_cow_block running node = rc->backref_cache.path[level]; I feel like I must be missing some higher level synchronization between the two, but if my suspicion is right, btrfs_reloc_cow_block would now grab a backref_cache node and use it while btrfs_backref_release_cache called kfree on it, probably resulting in UAF. > > > + rc->backref_cache.last_trans = trans->transid; > > > > Non blocking suggestion: > > This feels like it could be simpler if we initialized last_trans to 0 > > after allocated the reloc_control, then we could just check if we need > > to release, then unconditionally update. > > Do you mean this? > > if (rc->backref_cache.last_trans != trans->transid) { > if (rc->backref_cache.last_trans) > btrfs_backref_release_cache(&rc->backref_cache); > rc->backref_cache.last_trans = trans->transid; > } Yeah, or even if (rc->backref_cache.last_trans != trans->transid) btrfs_backref_release_cache(&rc->backref_cache); // no-op if they are equal rc->backref_cache.last_trans = trans->transid; > > Thanks, > > Josef