On 11/29/2023 1:22 AM, Luca Ceresoli wrote:
Hello,
since several weeks I am investigating a sporadic reboot failure on a
custom board based on i.MX6Q. There is an ATH9K Wi-Fi card connected
over PCIe, and the main suspect is the ath9k driver.
Anybody aware of this kind of bug with ath9k?
Some details about my tests follow.
This is on mainline v6.6 Linux, with only the board dts and a defconfig
added. The board dts itself is based on imx6q.dtsi and among others it
adds:
&pcie {
pinctrl-names = "default";
pinctrl-0 = <&pinctrl_pcie>;
reset-gpio = <&gpio2 20 GPIO_ACTIVE_LOW>;
status = "okay";
};
and:
&iomuxc {
/* ... */
imx6qdl-sabresd {
/* ... */
pinctrl_pcie: pciegrp {
fsl,pins = <
MX6QDL_PAD_EIM_A18__GPIO2_IO20 0x1b0b0
>;
};
/* ... */
};
};
Reboot usually works fine, but fails randomly in 1-5% of the
cases. The symptom is that the console stops producing any messages
at some random point in the shutdown sequence, even in the middle of a
line.
After about 7000 reboot attempts with different configurations it is
clear that enabling or disabling CONFIG_ATH9K is what makes the
difference:
1. kernels with CONFIG_ATH9K=n never fail
2. kernels with CONFIG_ATH9K=y do fail
Kernels built with CONFIG_ATH9K=y do fail even disabling all optional
CONFIG_ATH9K* options (rfkill, pcoem, btcoex and no_eeprom).
Similarly:
1. removing pcie from the device tree makes reboot work
2. leaving pcie in the device tree and removing all the peripherals
not required for booting, reboot does fail
On top of v6.6 I have applied all the potentially related commits from
master that appear as of now (8 in total):
git log --oneline --reverse --format=%H v6.6..origin/master -- \
drivers/net/wireless/ath/*.[ch] drivers/net/wireless/ath/ath9k/ \
| xargs git cherry-pick
and reboot still fails.
I have tested these mainline kernel versions, which no result:
v6.1.60, v5.15.137, v5.10.199, v5.10.
A first look at the ath9k driver code did not show anything obviously
wrong.
Any clues about how to further investigate would be very welcome.
I am obviously available to provide more info.
Do you have a reboot log with "initcall_debug debug" set on the kernel
command line and if so, does it always point to the PCI bus shutting
down the device drivers, pcie ports and ultimately the root complex?
We have seen something similar before with ath10k_pci and our
pcie-brcmstb driver which eventually was a result of having made
incorrect assumptions while implementing the platform_driver::shutdown
routine. There was a hard hang in ath10k_remove(), I do not recall the
details, but we were definitively doing something improper there.
imx6_pcie_shutdown() seems to much simpler, but my first guess would be
there.
Hope this helps.
--
Florian