Hi Keith and Ming, On 04/07/2019 05:27 AM, Ming Lei wrote: > On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote: >> On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>> Looking at current peak testing, I've got around 1.2% in queue enter >>> and exit. It's definitely not free, hence my question. Probably safe >>> to assume that we'll double that cycle counter, per IO. >> >> Okay, that's not negligible at all. I don't know of a faster reference >> than the percpu_ref, but that much overhead would have to rule out >> having a per hctx counter. > > Or not using any refcount in fast path, how about the following one? > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3ff3d7b49969..6fe334e12236 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2199,6 +2199,23 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, > return -ENOMEM; > } > > +static void blk_mq_wait_hctx_become_idle(struct blk_mq_hw_ctx *hctx, > + int dead_cpu) > +{ > + unsigned long msecs_left = 1000 * 10; > + > + while (msecs_left > 0) { > + if (blk_mq_hctx_idle(hctx)) > + break; > + msleep(5); > + msecs_left -= 5; > + } > + > + if (msecs_left > 0) > + printk(KERN_WARNING "requests not completed from " > + "CPU %d\n", dead_cpu); > +} > + > /* > * 'cpu' is going away. splice any existing rq_list entries from this > * software queue to the hw queue dispatch list, and ensure that it > @@ -2230,6 +2247,14 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) > spin_unlock(&hctx->lock); > While debugging the blk_mq_hctx_notify_dead(), I found that blk_mq_hctx_notify_dead() is called once for each hctx for the cpu to offline. As shown below between linux 2216 and line 2217, ctx might not be mapped to hctx. Why would we flush ctx->rq_lists[type] to hctx->dispatch? 2207 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) 2208 { 2209 struct blk_mq_hw_ctx *hctx; 2210 struct blk_mq_ctx *ctx; 2211 LIST_HEAD(tmp); 2212 enum hctx_type type; 2213 2214 hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead); 2215 ctx = __blk_mq_get_ctx(hctx->queue, cpu); 2216 type = hctx->type; ===>>>>> ctx might not be mappped to hctx, that is, cpumask_test_cpu(cpu, hctx->cpumask) = false 2217 2218 spin_lock(&ctx->lock); 2219 if (!list_empty(&ctx->rq_lists[type])) { 2220 list_splice_init(&ctx->rq_lists[type], &tmp); 2221 blk_mq_hctx_clear_pending(hctx, ctx); 2222 } 2223 spin_unlock(&ctx->lock); 2224 2225 if (list_empty(&tmp)) 2226 return 0; 2227 2228 spin_lock(&hctx->lock); 2229 list_splice_tail_init(&tmp, &hctx->dispatch); 2230 spin_unlock(&hctx->lock); 2231 2232 blk_mq_run_hw_queue(hctx, true); 2233 return 0; 2234 } For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the 4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue hctx: 1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3 2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2 3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1 4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0 Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2 in this case, its ctx->rq_lists[type] will however be moved to blk_mq_hw_ctx->queue_num = 3 during the 1st call of blk_mq_hctx_notify_dead(). Is this expected or a bug? If it is a bug, would the below help? [PATCH 1/1] blk-mq: do not splice ctx->rq_lists[type] to hctx->dispatch if ctx is not mapped to hctx When a cpu is offline, blk_mq_hctx_notify_dead() is called once for each hctx for the offline cpu. While blk_mq_hctx_notify_dead() is used to splice all ctx->rq_lists[type] to hctx->dispatch, it never checks whether the ctx is already mapped to the hctx. For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the 4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue hctx: 1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3 2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2 3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1 4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0 Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2 in this case, its ctx->rq_lists[type] will however be moved to blk_mq_hw_ctx->queue_num = 3 during the 1st call of blk_mq_hctx_notify_dead(). This patch would return and go ahead to next call of blk_mq_hctx_notify_dead() if ctx is not mapped to hctx. Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx> --- block/blk-mq.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index 3ff3d7b..b8ef489 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2212,6 +2212,10 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) enum hctx_type type; hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead); + + if (!cpumask_test_cpu(cpu, hctx->cpumask)) + return 0; + ctx = __blk_mq_get_ctx(hctx->queue, cpu); type = hctx->type; -- 2.7.4 Thank you very much! Dongli Zhang