Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



There might be two reasons:

1) You are still testing a multiple reply-queue device?

As before, I am testing by exposing mutliple queues to the SCSI midlayer. I had to make this change locally, as on mainline we still only expose a single queue and use the internal reply queue when enabling managed interrupts.

As I
mentioned last times, it is hard to map reply-queue into blk-mq
hctx correctly.

Here's my branch, if you want to check:

https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4

It's a bit messy (sorry), but you can see that the reply-queue in the LLDD is removed in commit 087b95af374.

I am now thinking of actually making this change to the LLDD in mainline to avoid any doubt in future.


2) concurrent dispatch to device, which can be observed by the
following patch.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 06081966549f..3590f6f947eb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -679,6 +679,8 @@ void blk_mq_start_request(struct request *rq)
 {
        struct request_queue *q = rq->q;

+       WARN_ON_ONCE(test_bit(BLK_MQ_S_INTERNAL_STOPPED, &rq->mq_hctx->state));
+
        trace_block_rq_issue(q, rq);

        if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {

However, I think it is hard to be 2#, since the current CPU is the last
CPU in hctx->cpu_mask.


I'll try it.


Hi Ming,

I am looking at this issue again.

I am using https://lore.kernel.org/linux-scsi/1571926881-75524-1-git-send-email-john.garry@xxxxxxxxxx/T/#t with expose_mq_experimental set. I guess you're going to say that this series is wrong, but I think it's ok for this purpose.

Forgetting that for a moment, maybe I have found an issue.

For the SCSI commands which timeout, I notice that scsi_set_blocked(reason=SCSI_MLQUEUE_EH_RETRY) was called 30 seconds earlier.

 scsi_set_blocked+0x20/0xb8
 __scsi_queue_insert+0x40/0x90
 scsi_softirq_done+0x164/0x1c8
 __blk_mq_complete_request_remote+0x18/0x20
 flush_smp_call_function_queue+0xa8/0x150
 generic_smp_call_function_single_interrupt+0x10/0x18
 handle_IPI+0xec/0x1a8
 arch_cpu_idle+0x10/0x18
 do_idle+0x1d0/0x2b0
 cpu_startup_entry+0x24/0x40
 secondary_start_kernel+0x1b4/0x208

I also notice that the __scsi_queue_insert() call, above, seems to retry to requeue the request on a dead rq in calling __scsi_queue_insert()->blk_mq_requeue_requet()->__blk_mq_requeue_request(), ***:

[ 1185.235243] psci: CPU1 killed.
[ 1185.238610] blk_mq_hctx_notify_dead cpu1 dead request_queue=0xffff0023ace24f60 (id=19) [ 1185.246530] blk_mq_hctx_notify_dead cpu1 dead request_queue=0xffff0023ace23f80 (id=17) [ 1185.254443] blk_mq_hctx_notify_dead cpu1 dead request_queue=0xffff0023ace22fa0 (id=15) [ 1185.262356] blk_mq_hctx_notify_dead cpu1 dead request_queue=0xffff0023ace21fc0 (id=13)*** [ 1185.270271] blk_mq_hctx_notify_dead cpu1 dead request_queue=0xffff0023ace20fe0 (id=11)
[ 1185.939451] scsi_softirq_done NEEDS_RETRY rq=0xffff0023b7416000
[ 1185.945359] scsi_set_blocked reason=0x1057
[ 1185.949444] __blk_mq_requeue_request request_queue=0xffff0023ace21fc0 id=13 rq=0xffff0023b7416000***

[...]

[ 1214.903455] scsi_timeout req=0xffff0023add29000 reserved=0
[ 1214.908946] scsi_timeout req=0xffff0023add29300 reserved=0
[ 1214.914424] scsi_timeout req=0xffff0023add29600 reserved=0
[ 1214.919909] scsi_timeout req=0xffff0023add29900 reserved=0

I guess that we're retrying as the SCSI failed in the LLDD for some reason.

So could this be the problem - we're attempting to requeue on a dead request queue?

Thanks,
John

Thanks as always,
John


Thanks,
Ming




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux