Patch "bus: mhi: host: Add MHI_PM_SYS_ERR_FAIL state" has been added to the 6.1-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Wed, 10 Apr 2024 12:09:04 -0400

This is a note to let you know that I've just added the patch titled

    bus: mhi: host: Add MHI_PM_SYS_ERR_FAIL state

to the 6.1-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     bus-mhi-host-add-mhi_pm_sys_err_fail-state.patch
and it can be found in the queue-6.1 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit cef1a7dbc7ddc135730424972b950319c81dd1d4
Author: Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx>
Date:   Fri Jan 12 11:08:00 2024 -0700

    bus: mhi: host: Add MHI_PM_SYS_ERR_FAIL state
    
    [ Upstream commit bce3f770684cc1d91ff9edab431b71ac991faf29 ]
    
    When processing a SYSERR, if the device does not respond to the MHI_RESET
    from the host, the host will be stuck in a difficult to recover state.
    The host will remain in MHI_PM_SYS_ERR_PROCESS and not clean up the host
    channels.  Clients will not be notified of the SYSERR via the destruction
    of their channel devices, which means clients may think that the device is
    still up.  Subsequent SYSERR events such as a device fatal error will not
    be processed as the state machine cannot transition from PROCESS back to
    DETECT.  The only way to recover from this is to unload the mhi module
    (wipe the state machine state) or for the mhi controller to initiate
    SHUTDOWN.
    
    This issue was discovered by stress testing soc_reset events on AIC100
    via the sysfs node.
    
    soc_reset is processed entirely in hardware.  When the register write
    hits the endpoint hardware, it causes the soc to reset without firmware
    involvement.  In stress testing, there is a rare race where soc_reset N
    will cause the soc to reset and PBL to signal SYSERR (fatal error).  If
    soc_reset N+1 is triggered before PBL can process the MHI_RESET from the
    host, then the soc will reset again, and re-run PBL from the beginning.
    This will cause PBL to lose all state.  PBL will be waiting for the host
    to respond to the new syserr, but host will be stuck expecting the
    previous MHI_RESET to be processed.
    
    Additionally, the AMSS EE firmware (QSM) was hacked to synthetically
    reproduce the issue by simulating a FW hang after the QSM issued a
    SYSERR.  In this case, soc_reset would not recover the device.
    
    For this failure case, to recover the device, we need a state similar to
    PROCESS, but can transition to DETECT.  There is not a viable existing
    state to use.  POR has the needed transitions, but assumes the device is
    in a good state and could allow the host to attempt to use the device.
    Allowing PROCESS to transition to DETECT invites the possibility of
    parallel SYSERR processing which could get the host and device out of
    sync.
    
    Thus, invent a new state - MHI_PM_SYS_ERR_FAIL
    
    This essentially a holding state.  It allows us to clean up the host
    elements that are based on the old state of the device (channels), but
    does not allow us to directly advance back to an operational state.  It
    does allow the detection and processing of another SYSERR which may
    recover the device, or allows the controller to do a clean shutdown.
    
    Signed-off-by: Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx>
    Reviewed-by: Carl Vanderlip <quic_carlv@xxxxxxxxxxx>
    Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx>
    Link: https://lore.kernel.org/r/20240112180800.536733-1-quic_jhugo@xxxxxxxxxxx
    Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c
index 04fbccff65ac2..60c1df048fa20 100644
--- a/drivers/bus/mhi/host/init.c
+++ b/drivers/bus/mhi/host/init.c
@@ -62,6 +62,7 @@ static const char * const mhi_pm_state_str[] = {
 	[MHI_PM_STATE_FW_DL_ERR] = "Firmware Download Error",
 	[MHI_PM_STATE_SYS_ERR_DETECT] = "SYS ERROR Detect",
 	[MHI_PM_STATE_SYS_ERR_PROCESS] = "SYS ERROR Process",
+	[MHI_PM_STATE_SYS_ERR_FAIL] = "SYS ERROR Failure",
 	[MHI_PM_STATE_SHUTDOWN_PROCESS] = "SHUTDOWN Process",
 	[MHI_PM_STATE_LD_ERR_FATAL_DETECT] = "Linkdown or Error Fatal Detect",
 };
diff --git a/drivers/bus/mhi/host/internal.h b/drivers/bus/mhi/host/internal.h
index 01fd10a399b61..6abf09da4f618 100644
--- a/drivers/bus/mhi/host/internal.h
+++ b/drivers/bus/mhi/host/internal.h
@@ -88,6 +88,7 @@ enum mhi_pm_state {
 	MHI_PM_STATE_FW_DL_ERR,
 	MHI_PM_STATE_SYS_ERR_DETECT,
 	MHI_PM_STATE_SYS_ERR_PROCESS,
+	MHI_PM_STATE_SYS_ERR_FAIL,
 	MHI_PM_STATE_SHUTDOWN_PROCESS,
 	MHI_PM_STATE_LD_ERR_FATAL_DETECT,
 	MHI_PM_STATE_MAX
@@ -104,14 +105,16 @@ enum mhi_pm_state {
 #define MHI_PM_FW_DL_ERR				BIT(7)
 #define MHI_PM_SYS_ERR_DETECT				BIT(8)
 #define MHI_PM_SYS_ERR_PROCESS				BIT(9)
-#define MHI_PM_SHUTDOWN_PROCESS				BIT(10)
+#define MHI_PM_SYS_ERR_FAIL				BIT(10)
+#define MHI_PM_SHUTDOWN_PROCESS				BIT(11)
 /* link not accessible */
-#define MHI_PM_LD_ERR_FATAL_DETECT			BIT(11)
+#define MHI_PM_LD_ERR_FATAL_DETECT			BIT(12)
 
 #define MHI_REG_ACCESS_VALID(pm_state)			((pm_state & (MHI_PM_POR | MHI_PM_M0 | \
 						MHI_PM_M2 | MHI_PM_M3_ENTER | MHI_PM_M3_EXIT | \
 						MHI_PM_SYS_ERR_DETECT | MHI_PM_SYS_ERR_PROCESS | \
-						MHI_PM_SHUTDOWN_PROCESS | MHI_PM_FW_DL_ERR)))
+						MHI_PM_SYS_ERR_FAIL | MHI_PM_SHUTDOWN_PROCESS |  \
+						MHI_PM_FW_DL_ERR)))
 #define MHI_PM_IN_ERROR_STATE(pm_state)			(pm_state >= MHI_PM_FW_DL_ERR)
 #define MHI_PM_IN_FATAL_STATE(pm_state)			(pm_state == MHI_PM_LD_ERR_FATAL_DETECT)
 #define MHI_DB_ACCESS_VALID(mhi_cntrl)			(mhi_cntrl->pm_state & mhi_cntrl->db_access)
diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c
index 8a4362d75fc43..27f8a40f288cf 100644
--- a/drivers/bus/mhi/host/pm.c
+++ b/drivers/bus/mhi/host/pm.c
@@ -36,7 +36,10 @@
  *     M0 <--> M0
  *     M0 -> FW_DL_ERR
  *     M0 -> M3_ENTER -> M3 -> M3_EXIT --> M0
- * L1: SYS_ERR_DETECT -> SYS_ERR_PROCESS --> POR
+ * L1: SYS_ERR_DETECT -> SYS_ERR_PROCESS
+ *     SYS_ERR_PROCESS -> SYS_ERR_FAIL
+ *     SYS_ERR_FAIL -> SYS_ERR_DETECT
+ *     SYS_ERR_PROCESS --> POR
  * L2: SHUTDOWN_PROCESS -> LD_ERR_FATAL_DETECT
  *     SHUTDOWN_PROCESS -> DISABLE
  * L3: LD_ERR_FATAL_DETECT <--> LD_ERR_FATAL_DETECT
@@ -93,7 +96,12 @@ static const struct mhi_pm_transitions dev_state_transitions[] = {
 	},
 	{
 		MHI_PM_SYS_ERR_PROCESS,
-		MHI_PM_POR | MHI_PM_SHUTDOWN_PROCESS |
+		MHI_PM_POR | MHI_PM_SYS_ERR_FAIL | MHI_PM_SHUTDOWN_PROCESS |
+		MHI_PM_LD_ERR_FATAL_DETECT
+	},
+	{
+		MHI_PM_SYS_ERR_FAIL,
+		MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
 		MHI_PM_LD_ERR_FATAL_DETECT
 	},
 	/* L2 States */
@@ -624,7 +632,13 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
 					!in_reset, timeout);
 		if (!ret || in_reset) {
 			dev_err(dev, "Device failed to exit MHI Reset state\n");
-			goto exit_sys_error_transition;
+			write_lock_irq(&mhi_cntrl->pm_lock);
+			cur_state = mhi_tryset_pm_state(mhi_cntrl,
+							MHI_PM_SYS_ERR_FAIL);
+			write_unlock_irq(&mhi_cntrl->pm_lock);
+			/* Shutdown may have occurred, otherwise cleanup now */
+			if (cur_state != MHI_PM_SYS_ERR_FAIL)
+				goto exit_sys_error_transition;
 		}
 
 		/*