On Mon, Oct 28, 2019 at 11:55:42AM +0000, John Garry wrote: > > > > > > For the SCSI commands which timeout, I notice that > > > scsi_set_blocked(reason=SCSI_MLQUEUE_EH_RETRY) was called 30 seconds > > > earlier. > > > > > > scsi_set_blocked+0x20/0xb8 > > > __scsi_queue_insert+0x40/0x90 > > > scsi_softirq_done+0x164/0x1c8 > > > __blk_mq_complete_request_remote+0x18/0x20 > > > flush_smp_call_function_queue+0xa8/0x150 > > > generic_smp_call_function_single_interrupt+0x10/0x18 > > > handle_IPI+0xec/0x1a8 > > > arch_cpu_idle+0x10/0x18 > > > do_idle+0x1d0/0x2b0 > > > cpu_startup_entry+0x24/0x40 > > > secondary_start_kernel+0x1b4/0x208 > > > > Could you investigate a bit the reason why timeout is triggered? > > Yeah, it does seem a strange coincidence that the SCSI command even failed > and we have to retry, since these should be uncommon events. I'll check on > this LLDD error. > > > > > Especially we suppose to drain all in-flight requests before the > > last CPU of this hctx becomes offline, and it shouldn't be caused by > > the hctx becoming dead, so still need you to confirm that all > > in-flight requests are really drained in your test. > > ok > > Or is it still > > possible to dispatch to LDD after BLK_MQ_S_INTERNAL_STOPPED is set? > > It shouldn't be. However it would seem that this IO had been dispatched to > the LLDD, the hctx dies, and then we attempt to requeue on that hctx. But this patch does wait for completion of in-flight request before shutdown the last CPU of this hctx. Thanks, Ming