Re: [PATCH 04/05] mptfusion: Fix for device offline while doing aggressive HBA reset

Bernd Schubert <bernd.schubert@xxxxxxxxxxxxxxxxxx> · Thu, 04 Aug 2011 15:47:40 +0200

On 08/04/2011 03:37 PM, Desai, Kashyap wrote:


-----Original Message-----
From: Bernd Schubert [mailto:bernd.schubert@xxxxxxxxxxxxxxxxxx]
Sent: Thursday, August 04, 2011 6:39 PM
To: Desai, Kashyap
Cc: linux-scsi@xxxxxxxxxxxxxxx; Nandigama, Nagalakshmi; Prakash, Sathya;
Moore, Eric; JBottomley@xxxxxxxxxxxxx
Subject: Re: [PATCH 04/05] mptfusion: Fix for device offline while doing
aggressive HBA reset

On 08/04/2011 01:13 PM, kashyap.desai@xxxxxxx wrote:
Issue:
Device goes offline while doing aggressive HBA reset
along with IO using some utility.

Root cause:
FW goes into bad state due to aggressive reset. Softreset does
not help to recover FW. And also aggressive reset open up the
window for Error handling thread to kicked off at the same time
HBA will be in constant RESET loop as part of aggressive reset
test case can lead Device to goes offline.

Changes:
1. Added extra check as below inside eh_timed_out call back as below.
if(ioc->ioc_reset_in_progress)
      Rc = EH_TIMER_RESET
2. Removed " DOORBELL_ACTIVE" check for SAS controller from task
management context.
     Since SAS controller uses high priority queue for task management.
This check is
     not required for SAS controller.
3. Moved SoftReset call to HardReset from Task Mgmt context.

[...]

--- a/drivers/message/fusion/mptscsih.c
+++ b/drivers/message/fusion/mptscsih.c
@@ -1630,7 +1630,13 @@ mptscsih_IssueTaskMgmt(MPT_SCSI_HOST *hd, u8
type, u8 channel, u8 id, int lun,
   		return 0;
   	}

-	if (ioc_raw_state&   MPI_DOORBELL_ACTIVE) {
+	/* DOORBELL ACTIVE check is not required if
+	*  MPI_IOCFACTS_CAPABILITY_HIGH_PRI_Q is supported.
+	*/
+
+	if (!((ioc->facts.IOCCapabilities&
MPI_IOCFACTS_CAPABILITY_HIGH_PRI_Q)
+		&&   (ioc->facts.MsgVersion>= MPI_VERSION_01_05))&&
+		(ioc_raw_state&   MPI_DOORBELL_ACTIVE)) {
   		printk(MYIOC_s_WARN_FMT
   			"TaskMgmt type=%x: ioc_state: "
   			"DOORBELL_ACTIVE (0x%x)!\n",
@@ -1729,7 +1735,7 @@ mptscsih_IssueTaskMgmt(MPT_SCSI_HOST *hd, u8
type, u8 channel, u8 id, int lun,
   		printk(MYIOC_s_WARN_FMT
   		       "Issuing Reset from %s!! doorbell=0x%08x\n",
   		       ioc->name, __func__, mpt_GetIocState(ioc, 0));
-		retval = mpt_Soft_Hard_ResetHandler(ioc, CAN_SLEEP);
+		retval = mpt_HardResetHandler(ioc, CAN_SLEEP);
   		mpt_free_msg_frame(ioc, mf);
   	}

Have you ever tested that with dual port 501030C parallel scsi HBAs? The
hard reset with those HBAs will reset *both* ports and eventually *both*
ports will fail. A couple of years ago I tried to convince Eric to
disable hard resets for those chips at all (and even sent a patch), but
Eric never agreed on that.
The soft-reset handler was a workaround for that problem, but with that
patch the issue will re-appear. The affected systems are still in
production and probably will still be for the next few years.

I did not tried with dual port 501030C parallel scsi HBA.. I remember that exact issue you have described here.
I can add check for ioc->bus_type == SAS to have HardReset and other case I will continue with SoftReset.
Just wanted to know Is this fine to avoid issue which you have mentioned ?

Pls let me know your view on it, so that I can resend the patch.

Yes, I think adding a test for SAS would be fine and would keep the 
workaround for 1030C Chips.

Thanks,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html