On 3/6/25 12:32 PM, Jeff Hugo wrote:
From: Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx> mhi_async_power_up() enables IRQs, at which point we can receive a syserr notification from the device. The syserr notification queues a work item that cannot execute until the pm_mutex is released. If we receive a syserr notification at the right time during mhi_async_power_up(), we will fail to initialize the device. The syserr work item will be pending. If mhi_async_power_up() detects the syserr, it will handle it. If the device is in PBL, then the PBL state transition event will be queued, resulting in a work item after the pending syserr work item. Once mhi_async_power_up() releases the pm_mutex the syserr work item can run. It will blindly attempt to reset the MHI state machine, which is the recovery action for syserr. PBL/SBL are not interrupt driven and will ignore the MHI Reset unless syserr is actively advertised. This will cause the syserr work item to timeout waiting for Reset to be cleared, and will leave the host state in syserr processing. The PBL transition work item will then run, and immediately fail because syserr processing is not a valid state for PBL transition. This leaves the device uninitialized. This issue has a fairly unique signature in the kernel log: [ 909.803598] mhi mhi3: Requested to power ON [ 909.803775] Qualcomm Cloud AI 100 0000:36:00.0: Fatal error received from device. Attempting to recover [ 909.803945] mhi mhi3: Power on setup success [ 911.808444] mhi mhi3: Device failed to exit MHI Reset state [ 911.808448] mhi mhi3: Device MHI is not in valid state We cannot remove the syserr handling from mhi_async_power_up() because the device may be in the syserr state, but we missed the notification as the irq was fired before irqs were enabled. We also can't queue the syserr work item from mhi_async_power_up() if syserr is detected because that may result in a duplicate work item, and cause the same issue since the duplicate item will blindly issue MHI Reset even if syserr is no longer active. Instead, add a check in the syserr work item to make sure that the device is in the syserr state if the device is in the PBL or SBL EEs. Signed-off-by: Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx> Signed-off-by: Jeff Hugo <jeff.hugo@xxxxxxxxxxxxxxxx>
Reviewed-by: Troy Hanson <quic_thanson@xxxxxxxxxxx>