On Wed, Jan 22, 2025 at 04:24:27PM +0100, Johan Hovold wrote: > On Wed, Jan 08, 2025 at 07:09:28PM +0530, Manivannan Sadhasivam via B4 Relay wrote: > > From: Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx> > > > > Currently, in mhi_pci_runtime_resume(), if the resume fails, recovery_work > > is started asynchronously and success is returned. But this doesn't align > > with what PM core expects as documented in > > Documentation/power/runtime_pm.rst: > > Cc: stable@xxxxxxxxxxxxxxx # 5.13 > > Reported-by: Johan Hovold <johan@xxxxxxxxxx> > > Closes: https://lore.kernel.org/mhi/Z2PbEPYpqFfrLSJi@xxxxxxxxxxxxxxxxxxxx > > Fixes: d3800c1dce24 ("bus: mhi: pci_generic: Add support for runtime PM") > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx> > > Reasoning above makes sense, and I do indeed see resume taking five > seconds longer with this patch as Loic suggested it would. I forgot to mention the following warnings that now show up when system resume succeeds. Recovery was run also before this patch but the "parent mhi0 should not be sleeping" warnings are new: [ 68.753288] qcom_mhi_qrtr mhi0_IPCR: failed to prepare for autoqueue transfer -22 [ 68.761109] qcom_mhi_qrtr mhi0_IPCR: PM: dpm_run_callback(): qcom_mhi_qrtr_pm_resume_early [qrtr_mhi] returns -22 [ 68.771804] qcom_mhi_qrtr mhi0_IPCR: PM: failed to resume early: error -22 [ 68.795053] mhi-pci-generic 0005:01:00.0: mhi_pci_resume [ 68.800709] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume [ 68.800794] mhi mhi0: Resuming from non M3 state (RESET) [ 68.800804] mhi-pci-generic 0005:01:00.0: failed to resume device: -22 [ 68.819517] mhi-pci-generic 0005:01:00.0: device recovery started [ 68.819532] mhi-pci-generic 0005:01:00.0: __mhi_power_down [ 68.819543] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm mutex taken [ 68.819554] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm lock taken [ 68.820060] wwan wwan0: port wwan0qcdm0 disconnected [ 68.824839] nvme nvme0: 12/0/0 default/read/poll queues [ 68.857908] wwan wwan0: port wwan0mbim0 disconnected [ 68.864012] wwan wwan0: port wwan0qmi0 disconnected [ 68.943307] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns [ 68.956253] mhi mhi0: Requested to power ON [ 68.960753] mhi mhi0: Power on setup success [ 68.965262] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event timeout_ms = 8000 [ 73.183086] mhi mhi0: Wait for device to enter SBL or Mission mode [ 73.653462] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event returns, ret = 0 [ 73.653752] mhi mhi0_DIAG: PM: parent mhi0 should not be sleeping [ 73.661955] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - returns [ 73.668461] mhi mhi0_MBIM: PM: parent mhi0 should not be sleeping [ 73.674950] mhi-pci-generic 0005:01:00.0: Recovery completed [ 73.681428] mhi mhi0_QMI: PM: parent mhi0 should not be sleeping [ 74.315919] OOM killer enabled. [ 74.316475] wwan wwan0: port wwan0qcdm0 attached [ 74.319206] Restarting tasks ... [ 74.322825] done. [ 74.322870] random: crng reseeded on system resumption [ 74.325956] wwan wwan0: port wwan0mbim0 attached [ 74.334467] wwan wwan0: port wwan0qmi0 attached > Unfortunately, something else is broken as the recovery code now > deadlocks again when the modem fails to resume (with both patches > applied): > > [ 729.833701] PM: suspend entry (deep) > [ 729.841377] Filesystems sync: 0.000 seconds > [ 729.867672] Freezing user space processes > [ 729.869494] Freezing user space processes completed (elapsed 0.001 seconds) > [ 729.869499] OOM killer disabled. > [ 729.869501] Freezing remaining freezable tasks > [ 729.870882] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) > [ 730.184254] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume > [ 730.190643] mhi mhi0: Resuming from non M3 state (SYS ERROR) > [ 730.196587] mhi-pci-generic 0005:01:00.0: failed to resume device: -22 > [ 730.203412] mhi-pci-generic 0005:01:00.0: device recovery started > > I've reproduced this three times in three different paths (runtime > resume before suspend; runtime resume during suspend; and during system > resume). > > I didn't try to figure what causes the deadlock this time (and lockdep > does not trigger), but you should be able to reproduce this by > instrumenting a resume failure. Johan