Re: [PATCH] bus: mhi: host: Add MHI_PM_SYS_ERR_FAIL state

Manivannan Sadhasivam <mani@xxxxxxxxxx> · Sat, 6 Jan 2024 19:41:11 +0530

On Thu, Jan 04, 2024 at 08:53:47AM -0700, Jeffrey Hugo wrote:
> On 1/4/2024 4:26 AM, Manivannan Sadhasivam wrote:
> > On Fri, Dec 08, 2023 at 03:13:53PM -0700, Jeffrey Hugo wrote:
> > > When processing a SYSERR, if the device does not respond to the MHI_RESET
> > > from the host, the host will be stuck in a difficult to recover state.
> > 
> > Under what scenario this can happen? Is the device not honoring MHI_RESET state
> > or crashed completely?
> 
> Digging up my notes from this patch, it was originally discovered via
> soc_reset stress testing.
> 

Through sysfs I believe...

> On AIC100 (and I assume other MHI devices because the hardware
> implementation is common), soc_reset is processed entirely in hardware.
> When the register write hits the endpoint, it causes the soc to reset
> without firmware involvement.
> 
> If you stress test soc_reset and hit the timing just right, you can have PBL
> signal syserr (fatal error) for soc_reset N, and then before PBL can process
> the MHI_RESET request from the host, soc_reset N+1 hits the endpoint causing
> the soc to reset, and re-run PBL from the beginning which causes PBL to lose
> all state.  This is how we discovered this issue, although the reproduction
> rate was rather low.
> 

Thanks for the info. Could you please add the same in commit message and send
next version?

Otherwise, the patch looks good to me.

> I was able to hack the AMSS EE firmware (QSM) to synthetically reproduce the
> issue as well.  Send a trigger to QSM via an unused MHI register to invoke
> syserr (non-fatal error), and then have QSM ignore the MHI_RESET request
> which would simulate some kind of FW hang.  soc_reset would not recover the
> device.
> 

Ok. Even though this is not an issue that would impact the users (commonly),
I'm inclined to have this change, just in case...

- Mani

> > 
> > - Mani
> > 
> > > The host will remain in MHI_PM_SYS_ERR_PROCESS and not clean up the host
> > > channels.  Clients will not be notified of the SYSERR via the destruction
> > > of their channel devices, which means clients may think that the device is
> > > still up.  Subsequent SYSERR events such as a device fatal error will not
> > > be processed as the state machine cannot transition from PROCESS back to
> > > DETECT.  The only way to recover from this is to unload the mhi module
> > > (wipe the state machine state) or for the mhi controller to initiate
> > > SHUTDOWN.
> > > 
> > > In this failure case, to recover the device, we need a state similar to
> > > PROCESS, but can transition to DETECT.  There is not a viable existing
> > > state to use.  POR has the needed transitions, but assumes the device is
> > > in a good state and could allow the host to attempt to use the device.
> > > Allowing PROCESS to transition to DETECT invites the possibility of
> > > parallel SYSERR processing which could get the host and device out of
> > > sync.
> > > 
> > > Thus, invent a new state - MHI_PM_SYS_ERR_FAIL
> > > 
> > > This essentially a holding state.  It allows us to clean up the host
> > > elements that are based on the old state of the device (channels), but
> > > does not allow us to directly advance back to an operational state.  It
> > > does allow the detection and processing of another SYSERR which may
> > > recover the device, or allows the controller to do a clean shutdown.
> > > 
> > > Signed-off-by: Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx>
> > > Reviewed-by: Carl Vanderlip <quic_carlv@xxxxxxxxxxx>
> > > ---
> > >   drivers/bus/mhi/host/init.c     |  1 +
> > >   drivers/bus/mhi/host/internal.h |  9 ++++++---
> > >   drivers/bus/mhi/host/pm.c       | 20 +++++++++++++++++---
> > >   3 files changed, 24 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c
> > > index e2c2f510b04f..d3f919277cf9 100644
> > > --- a/drivers/bus/mhi/host/init.c
> > > +++ b/drivers/bus/mhi/host/init.c
> > > @@ -62,6 +62,7 @@ static const char * const mhi_pm_state_str[] = {
> > >   	[MHI_PM_STATE_FW_DL_ERR] = "Firmware Download Error",
> > >   	[MHI_PM_STATE_SYS_ERR_DETECT] = "SYS ERROR Detect",
> > >   	[MHI_PM_STATE_SYS_ERR_PROCESS] = "SYS ERROR Process",
> > > +	[MHI_PM_STATE_SYS_ERR_FAIL] = "SYS ERROR Failure",
> > >   	[MHI_PM_STATE_SHUTDOWN_PROCESS] = "SHUTDOWN Process",
> > >   	[MHI_PM_STATE_LD_ERR_FATAL_DETECT] = "Linkdown or Error Fatal Detect",
> > >   };
> > > diff --git a/drivers/bus/mhi/host/internal.h b/drivers/bus/mhi/host/internal.h
> > > index 30ac415a3000..4b6deea17bcd 100644
> > > --- a/drivers/bus/mhi/host/internal.h
> > > +++ b/drivers/bus/mhi/host/internal.h
> > > @@ -88,6 +88,7 @@ enum mhi_pm_state {
> > >   	MHI_PM_STATE_FW_DL_ERR,
> > >   	MHI_PM_STATE_SYS_ERR_DETECT,
> > >   	MHI_PM_STATE_SYS_ERR_PROCESS,
> > > +	MHI_PM_STATE_SYS_ERR_FAIL,
> > >   	MHI_PM_STATE_SHUTDOWN_PROCESS,
> > >   	MHI_PM_STATE_LD_ERR_FATAL_DETECT,
> > >   	MHI_PM_STATE_MAX
> > > @@ -104,14 +105,16 @@ enum mhi_pm_state {
> > >   #define MHI_PM_FW_DL_ERR				BIT(7)
> > >   #define MHI_PM_SYS_ERR_DETECT				BIT(8)
> > >   #define MHI_PM_SYS_ERR_PROCESS				BIT(9)
> > > -#define MHI_PM_SHUTDOWN_PROCESS				BIT(10)
> > > +#define MHI_PM_SYS_ERR_FAIL				BIT(10)
> > > +#define MHI_PM_SHUTDOWN_PROCESS				BIT(11)
> > >   /* link not accessible */
> > > -#define MHI_PM_LD_ERR_FATAL_DETECT			BIT(11)
> > > +#define MHI_PM_LD_ERR_FATAL_DETECT			BIT(12)
> > >   #define MHI_REG_ACCESS_VALID(pm_state)			((pm_state & (MHI_PM_POR | MHI_PM_M0 | \
> > >   						MHI_PM_M2 | MHI_PM_M3_ENTER | MHI_PM_M3_EXIT | \
> > >   						MHI_PM_SYS_ERR_DETECT | MHI_PM_SYS_ERR_PROCESS | \
> > > -						MHI_PM_SHUTDOWN_PROCESS | MHI_PM_FW_DL_ERR)))
> > > +						MHI_PM_SYS_ERR_FAIL | MHI_PM_SHUTDOWN_PROCESS |  \
> > > +						MHI_PM_FW_DL_ERR)))
> > >   #define MHI_PM_IN_ERROR_STATE(pm_state)			(pm_state >= MHI_PM_FW_DL_ERR)
> > >   #define MHI_PM_IN_FATAL_STATE(pm_state)			(pm_state == MHI_PM_LD_ERR_FATAL_DETECT)
> > >   #define MHI_DB_ACCESS_VALID(mhi_cntrl)			(mhi_cntrl->pm_state & mhi_cntrl->db_access)
> > > diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c
> > > index a2f2feef1476..d0d033ce9984 100644
> > > --- a/drivers/bus/mhi/host/pm.c
> > > +++ b/drivers/bus/mhi/host/pm.c
> > > @@ -36,7 +36,10 @@
> > >    *     M0 <--> M0
> > >    *     M0 -> FW_DL_ERR
> > >    *     M0 -> M3_ENTER -> M3 -> M3_EXIT --> M0
> > > - * L1: SYS_ERR_DETECT -> SYS_ERR_PROCESS --> POR
> > > + * L1: SYS_ERR_DETECT -> SYS_ERR_PROCESS
> > > + *     SYS_ERR_PROCESS -> SYS_ERR_FAIL
> > > + *     SYS_ERR_FAIL -> SYS_ERR_DETECT
> > > + *     SYS_ERR_PROCESS --> POR
> > >    * L2: SHUTDOWN_PROCESS -> LD_ERR_FATAL_DETECT
> > >    *     SHUTDOWN_PROCESS -> DISABLE
> > >    * L3: LD_ERR_FATAL_DETECT <--> LD_ERR_FATAL_DETECT
> > > @@ -93,7 +96,12 @@ static const struct mhi_pm_transitions dev_state_transitions[] = {
> > >   	},
> > >   	{
> > >   		MHI_PM_SYS_ERR_PROCESS,
> > > -		MHI_PM_POR | MHI_PM_SHUTDOWN_PROCESS |
> > > +		MHI_PM_POR | MHI_PM_SYS_ERR_FAIL | MHI_PM_SHUTDOWN_PROCESS |
> > > +		MHI_PM_LD_ERR_FATAL_DETECT
> > > +	},
> > > +	{
> > > +		MHI_PM_SYS_ERR_FAIL,
> > > +		MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > >   		MHI_PM_LD_ERR_FATAL_DETECT
> > >   	},
> > >   	/* L2 States */
> > > @@ -629,7 +637,13 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl)
> > >   					!in_reset, timeout);
> > >   		if (!ret || in_reset) {
> > >   			dev_err(dev, "Device failed to exit MHI Reset state\n");
> > > -			goto exit_sys_error_transition;
> > > +			write_lock_irq(&mhi_cntrl->pm_lock);
> > > +			cur_state = mhi_tryset_pm_state(mhi_cntrl,
> > > +							MHI_PM_SYS_ERR_FAIL);
> > > +			write_unlock_irq(&mhi_cntrl->pm_lock);
> > > +			/* Shutdown may have occurred, otherwise cleanup now */
> > > +			if (cur_state != MHI_PM_SYS_ERR_FAIL)
> > > +				goto exit_sys_error_transition;
> > >   		}
> > >   		/*
> > > -- 
> > > 2.34.1
> > > 
> > > 
> > 
> 

-- 
மணிவண்ணன் சதாசிவம்