RE: [PATCH 2/5] fusion: vmware bug fix prevent inifinite retries

"Moore, Eric" <Eric.Moore@xxxxxxx> · Mon, 8 Jan 2007 15:03:15 -0700

On  Saturday, January 06, 2007 8:31 AM, James Bottomley wrote:

> 
> DID_BUS_BUSY causes an immediate retry, but it does debit the retry
> count, so it shouldn't cause "infinite retries" ... if it 
> does, there's
> something else wrong here.
> 
> I should also point out that the MPI_SCSI_STATUS_BUSY is
> SAM_STAT_BUSY ... this return will cause a queue stop and a 
> requeue, but
> it doesn't actually debit the retries, so it *may* cause an infinite
> loop if the system is permanently busy.
> 
> Finally, whatever's causing this, it should probably be 
> treated the same
> for all fusion bus types ...
> 

James -  I was incorrect in the way I worded this patch.  Please read
further.

Original request came to me an you from Manon Goo <manon@xxxxxxxx> on
November 21, see attached.

Here is what VMware says, per Adam Zimman <azimman@xxxxxxxxxx>:

"VMkernel emulates a 1030/SPI.   Path Failovers induce the vmkernel to
return a BUSY
status for VM initiated SCSI I/O requests.  After a few I/O commands are
returned with
BUSY status, the RHEL VM will make the disk read-only.  The host status
of DID_BUS_BUSY
causes the RHEL scsi error recovery process to retry a BUSY I/O at most
5 times and 
then return an I/O failure upward in the I/O stack. If the I/O request
failed with a scsi 
status of BUSY rather than a host status of DID_BUS_BUSY, the RHEL scsi
error recovery process
would retry the I/O indefinitely."

In the 03.02.19, we add added the current logic for the following
reason:

"When a target device responds with BUSY status, the MPT driver was
sending DID_OK to the 
SCSI mid layer, which caused the IO to be retried indefinitely between
the mid layer and the 
driver.  By changing the driver return status to DID_BUS_BUSY, the
target BUSY status can 
now flow through the mid layer to an upper layer Failover driver, which
will manage the I/O timeout."

Eric 

--- Begin Message ---

To: "Moore, Eric" <Eric.Moore@xxxxxxx>,	<James.Bottomley@xxxxxxxxxxxx>
Subject: concerning mptscsih.c
From: "Manon Goo" <manon@xxxxxxxx>
Date: Tue, 21 Nov 2006 18:39:26 -0700
Cc: "Andreas Dembach" <ad@xxxxxxxx>,	"David Berghoff" <david@xxxxxxxx>
Reply-to: "Manon Goo" <manon@xxxxxxxx>
Thread-index: AccN1xAfVOJ7JmFES22NGAYhC50wXg==
Thread-topic: concerning mptscsih.c

Dear Sirs,

When changing from kernel 2.6.13 to 2.6.14 a change to the mtpscsih.c 
driver was introduced thet changed the behaviour of the driver in respect 
to timeouts.

The introduced changes around line 760.  As far as I undestand this change 
propagets a bussy device running async as a host failture.
This is extremely troublesome when using the mptscsi driver with vmware ESX 
because esx expects the driver to wait when doing SAN pathfailovers or 
going async.
Is there any chance to have the old behavior ?

Thanks in advance
Manon Goo

                       break;

+               case MPI_IOCSTATUS_SCSI_DATA_OVERRUN:           /* 0x0044 */
+                       sc->resid=0;
               case MPI_IOCSTATUS_SCSI_RECOVERED_ERROR:        /* 0x0040 */
               case MPI_IOCSTATUS_SUCCESS:                     /* 0x0000 */
-                       scsi_status = pScsiReply->SCSIStatus;
-                       sc->result = (DID_OK << 16) | scsi_status;
+                       if (scsi_status == MPI_SCSI_STATUS_BUSY)
+                               sc->result = (DID_BUS_BUSY << 16) | 
scsi_status;
+                       else
+                               sc->result = (DID_OK << 16) | scsi_status;
                       if (scsi_state == 0) {
                               ;
                       } else if (scsi_state & 
MPI_SCSI_STATE_AUTOSENSE_VALID) {

Manon Goo
Dembach Goo Informatik GmbH & Co KG
Rathenauplatz 9
D-50674 Köln
Tel: +49 221 801483 0
Mobil: +49 177 8091974
Fax: +49 221 801483 20
Email: manon@xxxxxxxx

Attachment:
pgpCzBTAlg0ad.pgp

Description: PGP signature

--- End Message ---