On Wed, Jan 22, 2025 at 04:11:41PM +0100, Johan Hovold wrote: > On Wed, Jan 08, 2025 at 07:09:27PM +0530, Manivannan Sadhasivam via B4 Relay wrote: > > From: Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx> > > > > There are multiple places from where the recovery work gets scheduled > > asynchronously. Also, there are multiple places where the caller waits > > synchronously for the recovery to be completed. One such place is during > > the PM shutdown() callback. > > > > If the device is not alive during recovery_work, it will try to reset the > > device using pci_reset_function(). This function internally will take the > > device_lock() first before resetting the device. By this time, if the lock > > has already been acquired, then recovery_work will get stalled while > > waiting for the lock. And if the lock was already acquired by the caller > > which waits for the recovery_work to be completed, it will lead to > > deadlock. > > > > This is what happened on the X1E80100 CRD device when the device died > > before shutdown() callback. Driver core calls the driver's shutdown() > > callback while holding the device_lock() leading to deadlock. > > > > And this deadlock scenario can occur on other paths as well, like during > > the PM suspend() callback, where the driver core would hold the > > device_lock() before calling driver's suspend() callback. And if the > > recovery_work was already started, it could lead to deadlock. This is also > > observed on the X1E80100 CRD. > > > > So to fix both issues, use pci_try_reset_function() in recovery_work. This > > function first checks for the availability of the device_lock() before > > trying to reset the device. If the lock is available, it will acquire it > > and reset the device. Otherwise, it will return -EAGAIN. If that happens, > > recovery_work will fail with the error message "Recovery failed" as not > > much could be done. > > I can confirm that this patch (alone) fixes the deadlock on shutdown > and suspend as expected, but it does leave the system state that blocks > further suspend (this is with patches that tear down the PCI link). Yeah, we wouldn't know when not to return an error to unblock the suspend. So this is acceptable. > Looks like the modem is also hosed. > > [ 267.454616] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume > [ 267.461165] mhi mhi0: Resuming from non M3 state (SYS ERROR) > [ 267.467102] mhi-pci-generic 0005:01:00.0: failed to resume device: -22 > [ 267.473968] mhi-pci-generic 0005:01:00.0: device recovery started > [ 267.477460] mhi-pci-generic 0005:01:00.0: mhi_pci_suspend > [ 267.480331] mhi-pci-generic 0005:01:00.0: __mhi_power_down > [ 267.485917] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_suspend > [ 267.498339] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm mutex taken > [ 267.505618] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm lock taken > [ 267.513372] wwan wwan0: port wwan0qcdm0 disconnected > [ 267.519556] wwan wwan0: port wwan0mbim0 disconnected > [ 267.525544] wwan wwan0: port wwan0qmi0 disconnected > [ 267.573773] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns > [ 267.591199] mhi mhi0: Requested to power ON > [ 267.914688] mhi mhi0: Power on setup success > [ 267.914875] mhi mhi0: Wait for device to enter SBL or Mission mode > [ 267.919179] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event timeout_ms = 8000 > [ 276.189399] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event returns, ret = -110 > [ 276.198240] mhi-pci-generic 0005:01:00.0: __mhi_power_down > [ 276.203990] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm mutex taken > [ 276.211269] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm lock taken > [ 276.220024] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns > [ 276.226680] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - returns > [ 276.233417] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - mhi unprepare after power down > [ 276.242604] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - pci reset > [ 276.249881] mhi-pci-generic 0005:01:00.0: Recovery failed > [ 276.255536] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - returns > [ 276.328878] qcom-pcie 1bf8000.pci: qcom_pcie_suspend_noirq > [ 276.368851] qcom-pcie 1c00000.pci: qcom_pcie_suspend_noirq > [ 276.477900] qcom-pcie 1c00000.pci: Failed to enter L2/L3 > [ 276.483535] qcom-pcie 1c00000.pci: PM: dpm_run_callback(): genpd_suspend_noirq returns -110 > [ 276.492292] qcom-pcie 1c00000.pci: PM: failed to suspend noirq: error -110 > [ 276.500218] qcom-pcie 1bf8000.pci: qcom_pcie_resume_noirq > [ 276.721401] qcom-pcie 1bf8000.pci: PCIe Gen.4 x4 link up > [ 276.730639] PM: noirq suspend of devices failed > [ 276.818943] mhi-pci-generic 0005:01:00.0: mhi_pci_resume > [ 276.824582] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume > > Still better than hanging the machine, but perhaps not much point in > having recovery code that can't recover the device. > Unfortunately, we cannot know if we could not recover the device. > We obviously need to track down what is causing the modem to fail to > resume since 6.13-rc1 too. > Yeah, this is worth tracing down. > > Cc: stable@xxxxxxxxxxxxxxx # 5.12 > > Reported-by: Johan Hovold <johan@xxxxxxxxxx> > > Closes: https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@xxxxxxxxxxxxxxxxxxxx > > Reviewed-by: Johan Hovold <johan+linaro@xxxxxxxxxx> > Tested-by: Johan Hovold <johan+linaro@xxxxxxxxxx> > > And since I've spent way to much time debugging this and provided the > analysis of the deadlock: > > Analyzed-by: Johan Hovold <johan@xxxxxxxxxx> > Link: https://lore.kernel.org/mhi/Z2KKjWY2mPen6GPL@xxxxxxxxxxxxxxxxxxxx/ > Sure. > > Fixes: 7389337f0a78 ("mhi: pci_generic: Add suspend/resume/recovery procedure") > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx> > > --- > > drivers/bus/mhi/host/pci_generic.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/bus/mhi/host/pci_generic.c b/drivers/bus/mhi/host/pci_generic.c > > index 07645ce2119a..e92df380c785 100644 > > --- a/drivers/bus/mhi/host/pci_generic.c > > +++ b/drivers/bus/mhi/host/pci_generic.c > > @@ -1040,7 +1040,7 @@ static void mhi_pci_recovery_work(struct work_struct *work) > > err_unprepare: > > mhi_unprepare_after_power_down(mhi_cntrl); > > err_try_reset: > > - if (pci_reset_function(pdev)) > > + if (pci_try_reset_function(pdev)) > > dev_err(&pdev->dev, "Recovery failed\n"); > > Perhaps you should log the returned error here as a part of this patch > so we can tell when the recovery code failed due to the device lock > being held. > Makes sense. Added the errno to the log and applied to patch to mhi/next with your tags. Thanks a lot! - Mani -- மணிவண்ணன் சதாசிவம்