On Thu, Mar 06, 2025 at 10:32:26AM -0700, Jeff Hugo wrote: > From: Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx> > > mhi_async_power_up() enables IRQs, at which point we can receive a syserr > notification from the device. The syserr notification queues a work item > that cannot execute until the pm_mutex is released. > > If we receive a syserr notification at the right time during > mhi_async_power_up(), we will fail to initialize the device. > > The syserr work item will be pending. If mhi_async_power_up() detects the > syserr, it will handle it. If the device is in PBL, then the PBL state > transition event will be queued, resulting in a work item after the > pending syserr work item. Once mhi_async_power_up() releases the pm_mutex > the syserr work item can run. It will blindly attempt to reset the MHI > state machine, which is the recovery action for syserr. PBL/SBL are not > interrupt driven and will ignore the MHI Reset unless syserr is actively > advertised. This will cause the syserr work item to timeout waiting for > Reset to be cleared, and will leave the host state in syserr processing. > The PBL transition work item will then run, and immediately fail because > syserr processing is not a valid state for PBL transition. > > This leaves the device uninitialized. > > This issue has a fairly unique signature in the kernel log: > > [ 909.803598] mhi mhi3: Requested to power ON > [ 909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from device. Attempting to recover > [ 909.803945] mhi mhi3: Power on setup success > [ 911.808444] mhi mhi3: Device failed to exit MHI Reset state > [ 911.808448] mhi mhi3: Device MHI is not in valid state > > We cannot remove the syserr handling from mhi_async_power_up() because the > device may be in the syserr state, but we missed the notification as the > irq was fired before irqs were enabled. We also can't queue the syserr > work item from mhi_async_power_up() if syserr is detected because that may > result in a duplicate work item, and cause the same issue since the > duplicate item will blindly issue MHI Reset even if syserr is no longer > active. > > Instead, add a check in the syserr work item to make sure that the device > is in the syserr state if the device is in the PBL or SBL EEs. > Don't we need a Fixes tag? > Signed-off-by: Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx> > Signed-off-by: Jeff Hugo <jeff.hugo@xxxxxxxxxxxxxxxx> > --- > drivers/bus/mhi/host/pm.c | 18 +++++++++++++++++- > 1 file changed, 17 insertions(+), 1 deletion(-) > > diff --git a/drivers/bus/mhi/host/pm.c b/drivers/bus/mhi/host/pm.c > index 11c0e751f223..3dff0f932726 100644 > --- a/drivers/bus/mhi/host/pm.c > +++ b/drivers/bus/mhi/host/pm.c > @@ -602,6 +602,7 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl) > struct mhi_cmd *mhi_cmd; > struct mhi_event_ctxt *er_ctxt; > struct device *dev = &mhi_cntrl->mhi_dev->dev; > + bool reset_device = false; > int ret, i; > > dev_dbg(dev, "Transitioning from PM state: %s to: %s\n", > @@ -630,8 +631,23 @@ static void mhi_pm_sys_error_transition(struct mhi_controller *mhi_cntrl) > /* Wake up threads waiting for state transition */ > wake_up_all(&mhi_cntrl->state_event); > > - /* Trigger MHI RESET so that the device will not access host memory */ > + /* > + * Trigger MHI RESET so that the device will not access host memory. Move this comment before 'if (reset_device)'. > + * If the device is in PBL or SBL, it will only respond to RESET if > + * the device is in SYSERR state. SYSERR might already be cleared > + * at this point. > + */ > if (MHI_REG_ACCESS_VALID(prev_state)) { > + enum mhi_state cur_statemachine_state = mhi_get_mhi_state(mhi_cntrl); s/cur_statemachine_state/cur_state - Mani -- மணிவண்ணன் சதாசிவம்