Hi all,
I have query regarding usage of host_byte DID_REQUEUE vs DID_RESET
returned by LLD to SCSI mid layer.
Let me give some background here.
I am using megaraid_sas controller. megaraid_sas driver returns all
outstanding SCSI commands back to SCSI layer with DID_RESET host_byte
before resetting controller.
The intent is- all these commands returned with DID_RESET before
controller reset should be retried by SCSI.
In few distros, I have observed that if SYNCHRONIZE_CACHE command(should
be applicable for all non Read write commands) is outstanding before
resetting controller and driver returns those commands back with
DID_RESET then SYNCHRONIZE_CACHE command not retried but failed to upper
layer
but other READ/WRITE commands were not failed but retried. I was running
filesystem IOs and SYNHRONIZE_CACHE returned with error cause filesystem
to go in READ only mode.
Later I checked and realized below commit was missing in that distro
kernel and seems to cause the problem-
a621bac scsi_lib: correctly retry failed zero length REQ_TYPE_FS commands
But distro kernel has below commit-
89fb4cd scsi: handle flush errors properly
Issue does not hit on the kernels which don't have both commits but hits
when commit- "89fb4cd scsi: handle flush errors properly " is there
but commit- "a621bac scsi_lib: correctly retry failed zero length
REQ_TYPE_FS commands" is missing.
This issue is observed with mpt3sas driver as well and should be
applicable to all LLDs returning non read write commands with DID_RESET.
Returning DID_REQUEUE instead of DID_RESET from driver solves the problem
irrespective of whether these above mentioned commits are there or not in
kernel.
So I am thinking to use DID_REQUEUE instead of DID_RESET in megaraid_sas
driver for all SCSI commands(not only limited to SYNCHRONIZE_CACHE or non
read write commands)
before resetting controller. As I mentioned intent is those outstanding
commands returned by driver before doing controller reset should be
retried and as soon as reset is complete,
driver will be processing those commands. There is no sense key associated
with SCSI commands returned.
I browsed SCSI code and get to know DID_REQUEUE causes command to be
retried by calling- scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY).
This seems to be good enough for our requirement of commands to be
re-tried by SCSI layer.
Please provide feedback if anyone forsee any issue with using DID_REQUEUE
for this use case.
I will be doing some testing with DID_REQUEUE in place of DID_RESET in
megaraid_sas driver.
I have attached here a proposed patch for megaraid_sas driver.
If there are no concerns, I will send this patch to SCSI mailing list.
Thanks,
Sumit
--- Begin Message ---
- To: <kashyap.desai@xxxxxxxxxxxx>, <sumit.saxena@xxxxxxxxxxxx>
- Subject: [PATCH] megaraid_sas: Use DID_REQUEUE instead of DID_RESET in OCR path
- From: "Sumit Saxena" <sumit.saxena@xxxxxxxxxxxx>
- Date: Tue, 13 Dec 2016 18:34:55 +0530
Driver returns outstanding SCSI commands to SCSI layer with host_byte
set to DID_RESET before resetting controller so that SCSI layer should
retry
those commands.
Sending DID_RESET for non RW commands(e.g SYNCHRONIZE_CACHE) may cause
those commands getting failed and returning I/O erros on few distros which
has included commit- 89fb4cd scsi: handle flush errors properly but missed
commit- a621bac scsi_lib: correctly retry failed zero length REQ_TYPE_FS
commands.
However Read write commands returned with DID_RESET are not failed but
retried.
DID_REQUEUE seems safer to use instead of DID_RESET for all outstanding
commands before doing chip reset as it serves purpose of getting all
commands
to be retried by SCSI layer.
Signed-off-by: Sumit Saxena <sumit.saxena@xxxxxxxxxxxx>
Signed-off-by: Kashyap Desai <kashyap.desai@xxxxxxxxxxxx>
---
drivers/scsi/megaraid/megaraid_sas_base.c | 4 ++--
drivers/scsi/megaraid/megaraid_sas_fusion.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
b/drivers/scsi/megaraid/megaraid_sas_base.c
index 6484c38..959ce3e 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -1662,7 +1662,7 @@ megasas_queue_command(struct Scsi_Host *shost,
struct scsi_cmnd *scmd)
/* Check for an mpio path and adjust behavior */
if (atomic_read(&instance->adprecovery) ==
MEGASAS_ADPRESET_SM_INFAULT) {
if (megasas_check_mpio_paths(instance, scmd) ==
- (DID_RESET << 16)) {
+ (DID_REQUEUE << 16)) {
return SCSI_MLQUEUE_HOST_BUSY;
} else {
scmd->result = DID_NO_CONNECT << 16;
@@ -2483,7 +2483,7 @@ static int megasas_wait_for_outstanding(struct
megasas_instance *instance)
struct megasas_cmd, list);
list_del_init(&reset_cmd->list);
if (reset_cmd->scmd) {
- reset_cmd->scmd->result = DID_RESET << 16;
+ reset_cmd->scmd->result = DID_REQUEUE <<
16;
dev_notice(&instance->pdev->dev, "%d:%p
reset [%02x]\n",
reset_index, reset_cmd,
reset_cmd->scmd->cmnd[0]);
diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 24778ba..58cbe30 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -3362,7 +3362,7 @@ int megasas_check_mpio_paths(struct megasas_instance
*instance,
struct scsi_cmnd *scmd)
{
struct megasas_instance *peer_instance = NULL;
- int retval = (DID_RESET << 16);
+ int retval = (DID_REQUEUE << 16);
if (instance->peerIsPresent) {
peer_instance = megasas_get_peer_instance(instance);
--
2.4.11
--- End Message ---