Re: [PATCH] blk-mq: don't fail driver tag allocation because of inactive hctx

John Garry <john.garry@xxxxxxxxxx> · Thu, 4 Jun 2020 13:45:09 +0100

That's your patch - ok, I can try.

I still get timeouts and sometimes the same driver tag message occurs:

 1014.232417] run queue from wrong CPU 0, hctx active
[ 1014.237692] run queue from wrong CPU 0, hctx active
[ 1014.243014] run queue from wrong CPU 0, hctx active
[ 1014.248370] run queue from wrong CPU 0, hctx active
[ 1014.253725] run queue from wrong CPU 0, hctx active
[ 1014.259252] run queue from wrong CPU 0, hctx active
[ 1014.264492] run queue from wrong CPU 0, hctx active
[ 1014.269453] irq_shutdown irq146
[ 1014.272752] CPU55: shutdown
[ 1014.275552] psci: CPU55 killed (polled 0 ms)
[ 1015.151530] CPU56: shutdownr=1621MiB/s,w=0KiB/s][r=415k,w=0 IOPS][eta 
00m:00s]
[ 1015.154322] psci: CPU56 killed (polled 0 ms)
[ 1015.184345] CPU57: shutdown
[ 1015.187143] psci: CPU57 killed (polled 0 ms)
[ 1015.223388] CPU58: shutdown
[ 1015.226174] psci: CPU58 killed (polled 0 ms)
long sleep 8
[ 1045.234781] scsi_times_out req=0xffff041fa13e6300[r=0,w=0 IOPS][eta 
04m:30s]

[...]

I thought that if all the sched tags are put, then we should have no driver
tag for that same hctx, right? That seems to coincide with the timeout (30
seconds later)

That is weird, if there is driver tag found, that means the request is
in-flight and can't be completed by HW.

In blk_mq_hctx_has_requests(), we iterate the sched tags (when 
hctx->sched_tags is set). So can some requests not have a sched tag 
(even for scheduler set for the queue)?

 I assume you have integrated
global host tags patch in your test,

No, but the LLDD does not use request->tag - it generates its own.

 and suggest you to double check
hisi_sas's queue mapping which has to be exactly same with blk-mq's
mapping.

scheduler=none is ok, so I am skeptical of a problem there.

If yes, can you collect debugfs log after the timeout is triggered?

Same limitation as before - once SCSI timeout happens, SCSI error handling
kicks in and the shost no longer accepts commands, and, since that same
shost provides rootfs, becomes unresponsive. But I can try.

Just wondering why not install two disks in your test machine, :-)

The shost becomes unresponsive for all disks. So I could try nfs, but 
I'm not a fan :)

Cheers