On Mon, Sep 30, 2019 at 11:25 PM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote: > > We've historically had reports of being unable to mount file systems > because the tree log root couldn't be read. Usually this is the "parent > transid failure", but could be any of the related errors, including > "fsid mismatch" or "bad tree block", depending on which block got > allocated. > > The modification of the individual log root items are serialized on the > per-log root root_mutex. This means that any modification to the > per-subvol log root_item is completely protected. > > However we update the root item in the log root tree outside of the log > root tree log_mutex. We do this in order to allow multiple subvolumes > to be updated in each log transaction. > > This is problematic however because when we are writing the log root > tree out we update the super block with the _current_ log root node > information. Since these two operations happen independently of each > other, you can end up updating the log root tree in between writing out > the dirty blocks and setting the super block to point at the current > root. > > This means we'll point at the new root node that hasn't been written > out, instead of the one we should be pointing at. Thus whatever garbage > or old block we end up pointing at complains when we mount the file > system later and try to replay the log. > > Fix this by copying the log's root item into a local root item copy. > Then once we're safely under the log_root_tree->log_mutex we update the > root item in the log_root_tree. This way we do not modify the > log_root_tree while we're committing it, fixing the problem. > > cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx> > Reviewed-by: Chris Mason <clm@xxxxxx> Reviewed-by: Filipe Manana <fdmanana@xxxxxxxx> Looks good to me, great catch! Thanks. > --- > fs/btrfs/tree-log.c | 36 +++++++++++++++++++++++++++--------- > 1 file changed, 27 insertions(+), 9 deletions(-) > > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c > index 7cac09a6f007..1d7f22951ef2 100644 > --- a/fs/btrfs/tree-log.c > +++ b/fs/btrfs/tree-log.c > @@ -2908,7 +2908,8 @@ static int walk_log_tree(struct btrfs_trans_handle *trans, > * in the tree of log roots > */ > static int update_log_root(struct btrfs_trans_handle *trans, > - struct btrfs_root *log) > + struct btrfs_root *log, > + struct btrfs_root_item *root_item) > { > struct btrfs_fs_info *fs_info = log->fs_info; > int ret; > @@ -2916,10 +2917,10 @@ static int update_log_root(struct btrfs_trans_handle *trans, > if (log->log_transid == 1) { > /* insert root item on the first sync */ > ret = btrfs_insert_root(trans, fs_info->log_root_tree, > - &log->root_key, &log->root_item); > + &log->root_key, root_item); > } else { > ret = btrfs_update_root(trans, fs_info->log_root_tree, > - &log->root_key, &log->root_item); > + &log->root_key, root_item); > } > return ret; > } > @@ -3017,6 +3018,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, > struct btrfs_fs_info *fs_info = root->fs_info; > struct btrfs_root *log = root->log_root; > struct btrfs_root *log_root_tree = fs_info->log_root_tree; > + struct btrfs_root_item new_root_item; > int log_transid = 0; > struct btrfs_log_ctx root_log_ctx; > struct blk_plug plug; > @@ -3080,17 +3082,25 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, > goto out; > } > > + /* > + * We _must_ update under the root->log_mutex in order to make sure we > + * have a consistent view of the log root we are trying to commit at > + * this moment. > + * > + * We _must_ copy this into a local copy, because we are not holding the > + * log_root_tree->log_mutex yet. This is important because when we > + * commit the log_root_tree we must have a consistent view of the > + * log_root_tree when we update the super block to point at the > + * log_root_tree bytenr. If we update the log_root_tree here we'll race > + * with the commit and possibly point at the new block which we may not > + * have written out. > + */ > btrfs_set_root_node(&log->root_item, log->node); > + memcpy(&new_root_item, &log->root_item, sizeof(new_root_item)); > > root->log_transid++; > log->log_transid = root->log_transid; > root->log_start_pid = 0; > - /* > - * Update or create log root item under the root's log_mutex to prevent > - * races with concurrent log syncs that can lead to failure to update > - * log root item because it was not created yet. > - */ > - ret = update_log_root(trans, log); > /* > * IO has been started, blocks of the log tree have WRITTEN flag set > * in their headers. new modifications of the log will be written to > @@ -3111,6 +3121,14 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, > mutex_unlock(&log_root_tree->log_mutex); > > mutex_lock(&log_root_tree->log_mutex); > + > + /* > + * Now we are safe to update the log_root_tree because we're under the > + * log_mutex, and we're a current writer so we're holding the commit > + * open until we drop the log_mutex. > + */ > + ret = update_log_root(trans, log, &new_root_item); > + > if (atomic_dec_and_test(&log_root_tree->log_writers)) { > /* atomic_dec_and_test implies a barrier */ > cond_wake_up_nomb(&log_root_tree->log_writer_wait); > -- > 2.21.0 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”