On Thu, Oct 17, 2019 at 04:40:12PM +0100, John Garry wrote: > On 16/10/2019 17:19, John Garry wrote: > > On 16/10/2019 13:07, Ming Lei wrote: > > > On Wed, Oct 16, 2019 at 09:58:27AM +0100, John Garry wrote: > > > > On 14/10/2019 02:50, Ming Lei wrote: > > > > > Hi, > > > > > > > > > > Thomas mentioned: > > > > > " > > > > > That was the constraint of managed interrupts from the very > > > > > beginning: > > > > > > > > > > The driver/subsystem has to quiesce the interrupt line and the > > > > > associated > > > > > queue _before_ it gets shutdown in CPU unplug and not fiddle > > > > > with it > > > > > until it's restarted by the core when the CPU is plugged in > > > > > again. > > > > > " > > > > > > > > > > But no drivers or blk-mq do that before one hctx becomes dead(all > > > > > CPUs for one hctx are offline), and even it is worse, blk-mq stills > > > > > tries > > > > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). > > > > > > > > > > This patchset tries to address the issue by two stages: > > > > > > > > > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE > > > > > > > > > > - mark the hctx as internal stopped, and drain all in-flight requests > > > > > if the hctx is going to be dead. > > > > > > > > > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx > > > > > becomes dead > > > > > > > > > > - steal bios from the request, and resubmit them via > > > > > generic_make_request(), > > > > > then these IO will be mapped to other live hctx for dispatch > > > > > > > > > > Please comment & review, thanks! > > > > > > > > > > John, I don't add your tested-by tag since V3 have some changes, > > > > > and I appreciate if you may run your test on V3. > > > > > > > > Hi Ming, > > > > > > > > So I got around to doing some testing. The good news is that issue > > > > which we > > > > were experiencing in v3 series seems to have has gone away - alot more > > > > stable. > > > > > > > > However, unfortunately, I did notice some SCSI timeouts: > > > > > > > > 15508.615074] CPU2: shutdown > > > > [15508.617778] psci: CPU2 killed. > > > > [15508.651220] CPU1: shutdown > > > > [15508.653924] psci: CPU1 killed. > > > > [15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63 > > > > Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task: > > > > aborting task 0x00000000a7159744 > > > > KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas: > > > > sas_eh_handle_sas_errors: task 0x00000000a7159744 is done > > > > [15518.431266] sas: sas_scsi_find_task: aborting task 0x00000000d39731eb > > > > [15518.442539] sas: sas_eh_handle_sas_errors: task 0x00000000d39731eb is > > > > done > > > > [15518.449407] sas: sas_scsi_find_task: aborting task 0x000000009f77c9bd > > > > [15518.455899] sas: sas_eh_handle_sas_errors: task 0x000000009f77c9bd is > > > > done > > > > > > > > A couple of things to note: > > > > - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when > > > > inflights rqs !=0, and I don't see them for this timeout > > > > - 0 datarate reported from fio > > > > > > > > I'll have a look... > > > > > > What is the output of the following command? > > > > > > (cd /sys/kernel/debug/block/$SAS_DISK && find . -type f -exec grep -aH > > > . {} \;) > > I assume that you want this run at about the time SCSI EH kicks in for > > the timeout, right? If so, I need to find some simple sysfs entry to > > tell me of this occurrence, to trigger the capture. Or add something. My > > script is pretty dump. > > > > BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue() > > pretty soon before the problem happens - maybe a clue or maybe coincidence. > > > > I finally got to capture that debugfs dump at the point the SCSI IOs > timeout, as attached. Let me know if any problem receiving it. > > Here's a kernel log snippet at that point (I added some print for the > timeout): > > 609] psci: CPU6 killed. > [ 547.722217] CPU5: shutdown > [ 547.724926] psci: CPU5 killed. > [ 547.749951] irq_shutdown > [ 547.752701] IRQ 800: no longer affine to CPU4 > [ 547.757265] CPU4: shutdown > [ 547.759971] psci: CPU4 killed. > [ 547.790348] CPU3: shutdown > [ 547.793052] psci: CPU3 killed. > [ 547.818330] CPU2: shutdown > [ 547.821033] psci: CPU2 killed. > [ 547.854285] CPU1: shutdown > [ 547.856989] psci: CPU1 killed. > [ 575.925307] scsi_timeout req=0xffff0023b0dd9c00 reserved=0 > [ 575.930794] scsi_timeout req=0xffff0023b0df2700 reserved=0 >From the debugfs log, 66 requests are dumped, and 63 of them has been submitted to device, and the other 3 is in ->dispatch list via requeue after timeout is handled. You mentioned that: " - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when inflights rqs !=0, and I don't see them for this timeout" There might be two reasons: 1) You are still testing a multiple reply-queue device? As I mentioned last times, it is hard to map reply-queue into blk-mq hctx correctly. 2) concurrent dispatch to device, which can be observed by the following patch. diff --git a/block/blk-mq.c b/block/blk-mq.c index 06081966549f..3590f6f947eb 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -679,6 +679,8 @@ void blk_mq_start_request(struct request *rq) { struct request_queue *q = rq->q; + WARN_ON_ONCE(test_bit(BLK_MQ_S_INTERNAL_STOPPED, &rq->mq_hctx->state)); + trace_block_rq_issue(q, rq); if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { However, I think it is hard to be 2#, since the current CPU is the last CPU in hctx->cpu_mask. Thanks, Ming