On 1/22/2024 9:09 PM, Manivannan Sadhasivam wrote:
On Mon, Jan 22, 2024 at 04:09:53PM +0800, Baochen Qiang wrote:
On 1/22/2024 2:24 PM, Manivannan Sadhasivam wrote:
On Thu, Jan 04, 2024 at 11:39:12AM +0530, Manivannan Sadhasivam wrote:
+ Can, Qiang
[...]
To me it all sounds like the probe deferral is not handled properly in mac80211
stack. As you mentioned in the commit message that the dpm_prepare() blocks
probing of devices. It gets unblocked and trigerred in dpm_complete():
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/power/main.c#n1131
So if mac80211/ath11k cannot probe the devices at the dpm_complete() stage, then
it is definitely an issue that needs to be fixed properly.
To clarify, ath11k CAN probe the devices at dpm_complete() stage. The
problem is kernel does not wait for all probes to finish, and in that way we
will face the issue that user space applications are likely to fail because
they get thawed BEFORE WLAN is ready.
Hmm. Please give me some time to reproduce this issue locally. I will get back
to this thread with my analysis.
We reproduced the issue with the help of PCIe team (thanks Can). What we found
out was, during the resume from hibernation the faliure happens in
ath11k_core_resume(). Precisely here:
https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/tree/drivers/net/wireless/ath/ath11k/core.c?h=ath11k-hibernation-support#n850
This code waits for the QMI messages to arrive and eventually timesout. But the
impression I got from the start was that the mhi_power_up() always fails during
resume. In our investigation, we confirmed that the failure is not happening at
the MHI level.No, mhi_power_up() never fails as it only downloads PBL,
SBL and waits
for mission mode, no MHI device created hence not affected by the deferred
probe. However in addition to PBL/SBL, ath11k also needs to download m3.bin,
borad.bin and regdb.bin. Those files are part of WLAN firmware and are
downloaded via QMI messages. After mhi_power_up() succeeds
ath11k_core_resume() waits for QMI downloading those files. As you know QMI
relies on MHI channels, these channels are managed by qcom_mhi_qrtr_driver.
Since device probing is deferred, qcom_mhi_qrtr_driver has no chance to run
at this stage. As a result ath11k_core_resume() times out.
Thanks for the info, this clarifies the issue in detail.
I'm not pointing fingers here, but trying to understand why can't you fix
ath11k_core_resume() to not timeout? IMO this timeout should be handled as a
deferral case.
Let's see what happens if we do it in a deferral way:
1. In ath11k_core_resume() we returns success directly without waiting for
QMI downloading other firmware files.
2. Kernel unblocks device probe and schedules a work item to trigger all
deferred probing. As a result MHI devices are probed by qcom_mhi_qrtr_driver
and finally QMI is online.
3. kernel continues to resume and wake up userspace applications.
4. ath11k gets the message, either by kernel PM notification or something
else, that QMI is ready and then downloads other firmware files.
What happens if userspace applications or network stack immediately initiate
some WLAN request after resume back? Can ath11k handle such request? The
answer is, most likely, no. Because there is no guarantee that QMI finishes
downloading before those request.
What will happen to userspace if ath11k returns an error like -EBUSY or
something? Will the netdev completely go away?
It depends, and varies from application to application, we can't make
the assumption.
Besides, it doesn't make sense to return -EBUSY or something like that,
if ath11k returns success during resume. A WLAN driver is supposed to
finish everything, at least get back to the state before suspend, in the
resume callback. If it couldn't, report the error.
- Mani
- Mani