On Tue, Jan 23, 2024 at 09:44:11AM +0800, Baochen Qiang wrote: > > > On 1/22/2024 9:09 PM, Manivannan Sadhasivam wrote: > > On Mon, Jan 22, 2024 at 04:09:53PM +0800, Baochen Qiang wrote: > > > > > > > > > On 1/22/2024 2:24 PM, Manivannan Sadhasivam wrote: > > > > On Thu, Jan 04, 2024 at 11:39:12AM +0530, Manivannan Sadhasivam wrote: > > > > > > > > + Can, Qiang > > > > > > > > [...] > > > > > > > > > > > To me it all sounds like the probe deferral is not handled properly in mac80211 > > > > > > > stack. As you mentioned in the commit message that the dpm_prepare() blocks > > > > > > > probing of devices. It gets unblocked and trigerred in dpm_complete(): > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/power/main.c#n1131 > > > > > > > > > > > > > > So if mac80211/ath11k cannot probe the devices at the dpm_complete() stage, then > > > > > > > it is definitely an issue that needs to be fixed properly. > > > > > > To clarify, ath11k CAN probe the devices at dpm_complete() stage. The > > > > > > problem is kernel does not wait for all probes to finish, and in that way we > > > > > > will face the issue that user space applications are likely to fail because > > > > > > they get thawed BEFORE WLAN is ready. > > > > > > > > > > > > > > > > Hmm. Please give me some time to reproduce this issue locally. I will get back > > > > > to this thread with my analysis. > > > > > > > > > > > > > We reproduced the issue with the help of PCIe team (thanks Can). What we found > > > > out was, during the resume from hibernation the faliure happens in > > > > ath11k_core_resume(). Precisely here: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/tree/drivers/net/wireless/ath/ath11k/core.c?h=ath11k-hibernation-support#n850 > > > > > > > > This code waits for the QMI messages to arrive and eventually timesout. But the > > > > impression I got from the start was that the mhi_power_up() always fails during > > > > resume. In our investigation, we confirmed that the failure is not happening at > > > > the MHI level.No, mhi_power_up() never fails as it only downloads PBL, > > > > SBL and waits > > > for mission mode, no MHI device created hence not affected by the deferred > > > probe. However in addition to PBL/SBL, ath11k also needs to download m3.bin, > > > borad.bin and regdb.bin. Those files are part of WLAN firmware and are > > > downloaded via QMI messages. After mhi_power_up() succeeds > > > ath11k_core_resume() waits for QMI downloading those files. As you know QMI > > > relies on MHI channels, these channels are managed by qcom_mhi_qrtr_driver. > > > Since device probing is deferred, qcom_mhi_qrtr_driver has no chance to run > > > at this stage. As a result ath11k_core_resume() times out. > > > > > > > Thanks for the info, this clarifies the issue in detail. > > > > > > > > > > I'm not pointing fingers here, but trying to understand why can't you fix > > > > ath11k_core_resume() to not timeout? IMO this timeout should be handled as a > > > > deferral case. > > > Let's see what happens if we do it in a deferral way: > > > 1. In ath11k_core_resume() we returns success directly without waiting for > > > QMI downloading other firmware files. > > > 2. Kernel unblocks device probe and schedules a work item to trigger all > > > deferred probing. As a result MHI devices are probed by qcom_mhi_qrtr_driver > > > and finally QMI is online. > > > 3. kernel continues to resume and wake up userspace applications. > > > 4. ath11k gets the message, either by kernel PM notification or something > > > else, that QMI is ready and then downloads other firmware files. > > > > > > What happens if userspace applications or network stack immediately initiate > > > some WLAN request after resume back? Can ath11k handle such request? The > > > answer is, most likely, no. Because there is no guarantee that QMI finishes > > > downloading before those request. > > > > > > > What will happen to userspace if ath11k returns an error like -EBUSY or > > something? Will the netdev completely go away? > It depends, and varies from application to application, we can't make the > assumption. > > Besides, it doesn't make sense to return -EBUSY or something like that, if > ath11k returns success during resume. A WLAN driver is supposed to finish > everything, at least get back to the state before suspend, in the resume > callback. If it couldn't, report the error. > Ok. So I am getting the feeling that we need to talk to the PM people to get a proper solution. Clearly fixing the MHI code is not the right thing to do. We might need a separate callback that gets registered by the drivers like ath11k to wait for the dependency drivers to get probed. Can you initiate such a discussion? You can write to linux-pm@xxxxxxxxxxxxxxx, "Rafael J. Wysocki" <rafael@xxxxxxxxxx> and Pavel Machek <pavel@xxxxxx>. - Mani -- மணிவண்ணன் சதாசிவம்