RE: Device removal lockup with mptsas + scsi-mq

"Elliott, Robert (Server Storage)" <Elliott@xxxxxx> · Wed, 4 Feb 2015 19:29:12 +0000

> -----Original Message-----
> From: linux-scsi-owner@xxxxxxxxxxxxxxx [mailto:linux-scsi-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Tony Battersby
> Sent: Wednesday, 04 February, 2015 12:39 PM
> To: linux-scsi; Jens Axboe; Christoph Hellwig
> Cc: Sreekanth Reddy
> Subject: Device removal lockup with mptsas + scsi-mq
> 
> Summary:
> 
> When removing a SCSI device with scsi-mq, blk_mq_update_tag_set_depth()
> ends up waiting for commands to *other* SCSI devices to complete.  If
> those other SCSI devices are in the SDEV_BLOCK state, then the removal
> deadlocks.
> 
...
> 
> So far the only way I can get device removal to be reliable with scsi-mq
> enabled is by disabling the call to scsi_device_set_state(sdev,
> SDEV_BLOCK) entirely.  Device removal completes successfully with
> scsi-mq disabled, both with an unmodified kernel and with the patch from
> 2012.
> 
...
> Regarding mptsas:
> 
> When the cable is pulled, mptsas calls scsi_device_set_state(sdev,
> SDEV_BLOCK) and sets vtarget->deleted = 1.  If mptsas queuecommand()
> sees vtarget->deleted, it fails the I/O with DID_NO_CONNECT.  There is
> nowhere in mptsas where it calls scsi_device_set_state(sdev,
> SDEV_RUNNING) or scsi_internal_device_unblock() (except in the patch
> from 2012 just before deleting the device).  So setting SDEV_BLOCK is
> just blocking commands that can never do anything but fail anyway, so it
> can probably either be removed, or else a call to
> scsi_internal_device_unblock() should be added somewhere to unblock a
> device that came back.
> 

I ran into issues with mpt3sas usage of SDEV_BLOCK last year, and
recommend dropping that as part of any solution.

Old description:
"After a drive SAS link goes down, I often see device_blocked get
set to 3 and stay there forever, even if the drive comes back.

Although it seems good to keep the CPUs from retrying over and
over again, it's bad that the processes hang and become
unkillable, and really bad that the system cannot shutdown.

Everything seems to work better if you return host_byte
set to DID_SOFT_ERROR, which causes the SCSI midlayer to retry
a few times, or DID_IMM_RETRY which causes infinite retries,
or DID_ERROR with CHECK CONDITION status and an additional sense
code explaining the error.

If the drive is gone too long, you want the application to 
give up and quit.  On the other hand, retrying while giving 
it time to come back is also important.  In SAS, the I_T
nexus loss time should be the basis for calculating how
long to wait."

---
Rob Elliott    HP Server Storage

��.n��������+%������w��{.n�����{������ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f