On Wed, Feb 13, 2019 at 11:33 AM Qu Wenruo <wqu@xxxxxxxx> wrote: > > From: Josef Bacik <josef@xxxxxxxxxxxxxx> > > Qgroups will do the old roots lookup at delayed ref time, which could be > while walking down the extent root while running a delayed ref. This > should be fine, except we specifically lock eb's in the backref walking > code irrespective of path->skip_locking, which deadlocks the system. > Fix up the backref code to honor path->skip_locking, nobody will be > modifying the commit_root when we're searching so it's completely safe > to do. > > This happens Since fb235dc06fac ("btrfs: qgroup: Move half of the qgroup > accounting time out of commit trans"), kernel may lockup with quota > enabled. > > There is one backref trace triggered by snapshot dropping along with > write operation in the source subvolume. The example can be reliably > reproduced: > > btrfs-cleaner D 0 4062 2 0x80000000 > Call Trace: > schedule+0x32/0x90 > btrfs_tree_read_lock+0x93/0x130 [btrfs] > find_parent_nodes+0x29b/0x1170 [btrfs] > btrfs_find_all_roots_safe+0xa8/0x120 [btrfs] > btrfs_find_all_roots+0x57/0x70 [btrfs] > btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs] > btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs] > btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs] > do_walk_down+0x541/0x5e3 [btrfs] > walk_down_tree+0xab/0xe7 [btrfs] > btrfs_drop_snapshot+0x356/0x71a [btrfs] > btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs] > cleaner_kthread+0x12b/0x160 [btrfs] > kthread+0x112/0x130 > ret_from_fork+0x27/0x50 > > When dropping snapshots with qgroup enabled, we will trigger backref > walk. > > However such backref walk at that timing is pretty dangerous, as if one > of the parent nodes get WRITE locked by other thread, we could cause a > dead lock. > > For example: > > FS 260 FS 261 (Dropped) > node A node B > / \ / \ > node C node D node E > / \ / \ / \ > leaf F|leaf G|leaf H|leaf I|leaf J|leaf K > > The lock sequence would be: > > Thread A (cleaner) | Thread B (other writer) > ----------------------------------------------------------------------- > write_lock(B) | > write_lock(D) | > ^^^ called by walk_down_tree() | > | write_lock(A) > | write_lock(D) << Stall > read_lock(H) << for backref walk | > read_lock(D) << lock owner is | > the same thread A | > so read lock is OK | > read_lock(A) << Stall | > > So thread A hold write lock D, and needs read lock A to unlock. > While thread B holds write lock A, while needs lock D to unlock. > > This will cause a deadlock. > > This is not only limited to snapshot dropping case. > As the backref walk, even only happens on commit trees, is breaking the > normal top-down locking order, makes it deadlock prone. > > Fixes: fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans") > CC: stable@xxxxxxxxxxxxxxx # 4.19+ > Reported-and-tested-by: David Sterba <dsterba@xxxxxxxx> > Reported-by: Filipe Manana <fdmanana@xxxxxxxx> > Reviewed-by: Qu Wenruo <wqu@xxxxxxxx> > Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx> > [ copy logs and deadlock analysis from Qu's patch ] > [ rebase to latest branch and fix lock assert bug ] > Signed-off-by: David Sterba <dsterba@xxxxxxxx> Reviewed-by: Filipe Manana <fdmanana@xxxxxxxx> Looks good. > --- > changelog: > v2: > - Rebased to latest misc-next branch. > - Fix a lock assert by moving btrfs_set_lock_blocking_read() into the > same branch of btrfs_tree_read_lock(). > --- > fs/btrfs/backref.c | 19 ++++++++++++------- > 1 file changed, 12 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c > index 136454dbb4af..6e647520b65a 100644 > --- a/fs/btrfs/backref.c > +++ b/fs/btrfs/backref.c > @@ -712,7 +712,7 @@ static int resolve_indirect_refs(struct btrfs_fs_info *fs_info, > * read tree blocks and add keys where required. > */ > static int add_missing_keys(struct btrfs_fs_info *fs_info, > - struct preftrees *preftrees) > + struct preftrees *preftrees, bool lock) > { > struct prelim_ref *ref; > struct extent_buffer *eb; > @@ -737,12 +737,14 @@ static int add_missing_keys(struct btrfs_fs_info *fs_info, > free_extent_buffer(eb); > return -EIO; > } > - btrfs_tree_read_lock(eb); > + if (lock) > + btrfs_tree_read_lock(eb); > if (btrfs_header_level(eb) == 0) > btrfs_item_key_to_cpu(eb, &ref->key_for_search, 0); > else > btrfs_node_key_to_cpu(eb, &ref->key_for_search, 0); > - btrfs_tree_read_unlock(eb); > + if (lock) > + btrfs_tree_read_unlock(eb); > free_extent_buffer(eb); > prelim_ref_insert(fs_info, &preftrees->indirect, ref, NULL); > cond_resched(); > @@ -1227,7 +1229,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans, > > btrfs_release_path(path); > > - ret = add_missing_keys(fs_info, &preftrees); > + ret = add_missing_keys(fs_info, &preftrees, path->skip_locking == 0); > if (ret) > goto out; > > @@ -1288,11 +1290,14 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans, > ret = -EIO; > goto out; > } > - btrfs_tree_read_lock(eb); > - btrfs_set_lock_blocking_read(eb); > + if (!path->skip_locking) { > + btrfs_tree_read_lock(eb); > + btrfs_set_lock_blocking_read(eb); > + } > ret = find_extent_in_eb(eb, bytenr, > *extent_item_pos, &eie, ignore_offset); > - btrfs_tree_read_unlock_blocking(eb); > + if (!path->skip_locking) > + btrfs_tree_read_unlock_blocking(eb); > free_extent_buffer(eb); > if (ret < 0) > goto out; > -- > 2.18.0 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”