Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 20/10/2019 11:14, Ming Lei wrote:
ght? If so, I need to find some simple sysfs entry to
> > tell me of this occurrence, to trigger the capture. Or add something. My
> > script is pretty dump.
> >
> > BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue()
> > pretty soon before the problem happens - maybe a clue or maybe coincidence.
> >
>
> I finally got to capture that debugfs dump at the point the SCSI IOs
> timeout, as attached. Let me know if any problem receiving it.
>
> Here's a kernel log snippet at that point (I added some print for the
> timeout):
>
> 609] psci: CPU6 killed.
> [  547.722217] CPU5: shutdown
> [  547.724926] psci: CPU5 killed.
> [  547.749951] irq_shutdown
> [  547.752701] IRQ 800: no longer affine to CPU4
> [  547.757265] CPU4: shutdown
> [  547.759971] psci: CPU4 killed.
> [  547.790348] CPU3: shutdown
> [  547.793052] psci: CPU3 killed.
> [  547.818330] CPU2: shutdown
> [  547.821033] psci: CPU2 killed.
> [  547.854285] CPU1: shutdown
> [  547.856989] psci: CPU1 killed.
> [  575.925307] scsi_timeout req=0xffff0023b0dd9c00 reserved=0
> [  575.930794] scsi_timeout req=0xffff0023b0df2700 reserved=0
From the debugfs log, 66 requests are dumped, and 63 of them has
been submitted to device, and the other 3 is in ->dispatch list
via requeue after timeout is handled.


Hi Ming,

You mentioned that:

" - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
 inflights rqs !=0, and I don't see them for this timeout"

There might be two reasons:

1) You are still testing a multiple reply-queue device?

As before, I am testing by exposing mutliple queues to the SCSI midlayer. I had to make this change locally, as on mainline we still only expose a single queue and use the internal reply queue when enabling managed interrupts.

As I
mentioned last times, it is hard to map reply-queue into blk-mq
hctx correctly.

Here's my branch, if you want to check:

https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4

It's a bit messy (sorry), but you can see that the reply-queue in the LLDD is removed in commit 087b95af374.

I am now thinking of actually making this change to the LLDD in mainline to avoid any doubt in future.


2) concurrent dispatch to device, which can be observed by the
following patch.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 06081966549f..3590f6f947eb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -679,6 +679,8 @@ void blk_mq_start_request(struct request *rq)
 {
        struct request_queue *q = rq->q;

+       WARN_ON_ONCE(test_bit(BLK_MQ_S_INTERNAL_STOPPED, &rq->mq_hctx->state));
+
        trace_block_rq_issue(q, rq);

        if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {

However, I think it is hard to be 2#, since the current CPU is the last
CPU in hctx->cpu_mask.


I'll try it.

Thanks as always,
John


Thanks,
Ming


.






[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux