Re: [PATCH] block: Fix a race between tag iteration and hardware queue changes

Dongli Zhang <dongli.zhang@xxxxxxxxxx> · Wed, 3 Apr 2019 09:37:36 +0800

Hi Bart,

On 4/3/19 6:32 AM, Bart Van Assche wrote:
> Since the callback function called by blk_mq_queue_tag_busy_iter()
> may sleep calling synchronize_rcu() from __blk_mq_update_nr_hw_queues()
> is not sufficient to wait until blk_mq_queue_tag_busy_iter() has
> finished. Instead of making __blk_mq_update_nr_hw_queues() wait until
> q->q_usage_counter == 0 is globally visible, do not iterate over tags
> if the request queue is frozen.>
> Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> Cc: Hannes Reinecke <hare@xxxxxxxx>
> Cc: James Smart <james.smart@xxxxxxxxxxxx>
> Cc: Ming Lei <ming.lei@xxxxxxxxxx>
> Cc: Jianchao Wang <jianchao.w.wang@xxxxxxxxxx>
> Cc: Keith Busch <keith.busch@xxxxxxxxx>
> Cc: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>
> Fixes: 530ca2c9bd69 ("blk-mq: Allow blocking queue tag iter callbacks") # v4.19.
> Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx>
> ---
>  block/blk-mq-tag.c | 10 +++++-----
>  block/blk-mq.c     |  5 +----
>  2 files changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index a4931fc7be8a..89f479634b4d 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -383,14 +383,13 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>  
>  	/*
>  	 * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
> -	 * while the queue is frozen. So we can use q_usage_counter to avoid
> -	 * racing with it. __blk_mq_update_nr_hw_queues() uses
> -	 * synchronize_rcu() to ensure this function left the critical section
> -	 * below.
> +	 * while the queue is frozen. Hold q_usage_counter to serialize
> +	 * __blk_mq_update_nr_hw_queues() against this function.
>  	 */
>  	if (!percpu_ref_tryget(&q->q_usage_counter))
>  		return;
> -
> +	if (atomic_read(&q->mq_freeze_depth))
> +		goto out;
>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		struct blk_mq_tags *tags = hctx->tags;
>  
> @@ -405,6 +404,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>  			bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
>  		bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
>  	}
> +out:
>  	blk_queue_exit(q);

If callback function goes to sleep, we would not be able to reach at
blk_queue_exit() to dec q->q_usage_counter.

On the other side of __blk_mq_update_nr_hw_queues(), once blk_mq_freeze_queue()
is passed, __PERCPU_REF_DEAD is set and q->q_usage_counter should always be 0.
Otherwise, blk_mq_freeze_queue_wait() would not move forward.

I think we would even not be able to call the callback (which might sleep) if
blk_mq_freeze_queue() is already passed?

Why we need extra check of q->mq_freeze_depth? Is it an optimization?

Please let me know if my understanding is incorrect.

Thank you very much!

>  }
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 652d0c6d5945..f9fc1536408d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3226,10 +3226,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  
>  	list_for_each_entry(q, &set->tag_list, tag_set_list)
>  		blk_mq_freeze_queue(q);
> -	/*
> -	 * Sync with blk_mq_queue_tag_busy_iter.
> -	 */
> -	synchronize_rcu();
> +
>  	/*
>  	 * Switch IO scheduler to 'none', cleaning up the data associated
>  	 * with the previous scheduler. We will switch back once we are done
> 

Dongli Zhang