Re: [PATCH] blk-mq: Wait for for hctx requests on CPU unplug

Dongli Zhang <dongli.zhang@xxxxxxxxxx> · Sun, 7 Apr 2019 21:55:20 +0800

Hi Keith and Ming,

On 04/07/2019 05:27 AM, Ming Lei wrote:
> On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote:
>> On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>> Looking at current peak testing, I've got around 1.2% in queue enter
>>> and exit. It's definitely not free, hence my question. Probably safe
>>> to assume that we'll double that cycle counter, per IO.
>>
>> Okay, that's not negligible at all. I don't know of a faster reference
>> than the percpu_ref, but that much overhead would have to rule out
>> having a per hctx counter.
> 
> Or not using any refcount in fast path, how about the following one?
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3ff3d7b49969..6fe334e12236 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2199,6 +2199,23 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  	return -ENOMEM;
>  }
>  
> +static void blk_mq_wait_hctx_become_idle(struct blk_mq_hw_ctx *hctx,
> +		int dead_cpu)
> +{
> +	unsigned long msecs_left = 1000 * 10;
> +
> +	while (msecs_left > 0) {
> +		if (blk_mq_hctx_idle(hctx))
> +			break;
> +		msleep(5);
> +		msecs_left -= 5;
> +	}
> +
> +	if (msecs_left > 0)
> +		printk(KERN_WARNING "requests not completed from "
> +			"CPU %d\n", dead_cpu);
> +}
> +
>  /*
>   * 'cpu' is going away. splice any existing rq_list entries from this
>   * software queue to the hw queue dispatch list, and ensure that it
> @@ -2230,6 +2247,14 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>  	spin_unlock(&hctx->lock);
>  

While debugging the blk_mq_hctx_notify_dead(), I found that
blk_mq_hctx_notify_dead() is called once for each hctx for the cpu to offline.

As shown below between linux 2216 and line 2217, ctx might not be mapped to
hctx. Why would we flush ctx->rq_lists[type] to hctx->dispatch?

2207 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
2208 {
2209         struct blk_mq_hw_ctx *hctx;
2210         struct blk_mq_ctx *ctx;
2211         LIST_HEAD(tmp);
2212         enum hctx_type type;
2213
2214         hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
2215         ctx = __blk_mq_get_ctx(hctx->queue, cpu);
2216         type = hctx->type;

===>>>>> ctx might not be mappped to hctx, that is,
cpumask_test_cpu(cpu, hctx->cpumask) = false

2217
2218         spin_lock(&ctx->lock);
2219         if (!list_empty(&ctx->rq_lists[type])) {
2220                 list_splice_init(&ctx->rq_lists[type], &tmp);
2221                 blk_mq_hctx_clear_pending(hctx, ctx);
2222         }
2223         spin_unlock(&ctx->lock);
2224
2225         if (list_empty(&tmp))
2226                 return 0;
2227
2228         spin_lock(&hctx->lock);
2229         list_splice_tail_init(&tmp, &hctx->dispatch);
2230         spin_unlock(&hctx->lock);
2231
2232         blk_mq_run_hw_queue(hctx, true);
2233         return 0;
2234 }


For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the
4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue
hctx:

1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3
2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2
3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1
4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0

Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2
in this case, its ctx->rq_lists[type] will however be moved to
blk_mq_hw_ctx->queue_num = 3 during the 1st call of
blk_mq_hctx_notify_dead().


Is this expected or a bug?

If it is a bug, would the below help?

[PATCH 1/1] blk-mq: do not splice ctx->rq_lists[type] to hctx->dispatch if ctx
is not mapped to hctx

When a cpu is offline, blk_mq_hctx_notify_dead() is called once for each
hctx for the offline cpu.

While blk_mq_hctx_notify_dead() is used to splice all ctx->rq_lists[type]
to hctx->dispatch, it never checks whether the ctx is already mapped to the
hctx.

For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the
4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue
hctx:

1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3
2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2
3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1
4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0

Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2
in this case, its ctx->rq_lists[type] will however be moved to
blk_mq_hw_ctx->queue_num = 3 during the 1st call of
blk_mq_hctx_notify_dead().

This patch would return and go ahead to next call of
blk_mq_hctx_notify_dead() if ctx is not mapped to hctx.

Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
---
 block/blk-mq.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ff3d7b..b8ef489 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2212,6 +2212,10 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu,
struct hlist_node *node)
 	enum hctx_type type;

 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
+
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;

-- 
2.7.4


Thank you very much!

Dongli Zhang