Pali Rohár <pali@xxxxxxxxxx> writes: > On Wednesday 28 October 2020 18:16:26 Bjorn Helgaas wrote: >> [+cc Pali, Marek, Thomas, Jason] >> >> On Wed, Oct 28, 2020 at 04:40:00PM +0000, ™֟☻̭҇ Ѽ ҉ ® wrote: >> > On 28/10/2020 16:08, Toke Høiland-Jørgensen wrote: >> > > Bjorn Helgaas <helgaas@xxxxxxxxxx> writes: >> > > > On Wed, Oct 28, 2020 at 02:36:13PM +0100, Toke Høiland-Jørgensen wrote: >> > > > > Toke Høiland-Jørgensen <toke@xxxxxxxxxx> writes: >> > > > > > Bjorn Helgaas <helgaas@xxxxxxxxxx> writes: >> > > > > > >> > > > > > > [+cc vtolkm] >> > > > > > > >> > > > > > > On Tue, Oct 27, 2020 at 04:43:20PM +0100, Toke Høiland-Jørgensen wrote: >> > > > > > > > Hi everyone >> > > > > > > > >> > > > > > > > I'm trying to get a mainline kernel to run on my Turris Omnia, and am >> > > > > > > > having some trouble getting the PCI bus to work correctly. Specifically, >> > > > > > > > I'm running a 5.10-rc1 kernel (torvalds/master as of this moment), with >> > > > > > > > the resource request fix[0] applied on top. >> > > > > > > > >> > > > > > > > The kernel boots fine, and the patch in [0] makes the PCI devices show >> > > > > > > > up. But I'm still getting initialisation errors like these: >> > > > > > > > >> > > > > > > > [ 1.632709] pci 0000:01:00.0: BAR 0: error updating (0xe0000004 != 0xffffffff) >> > > > > > > > [ 1.632714] pci 0000:01:00.0: BAR 0: error updating (high 0x000000 != 0xffffffff) >> > > > > > > > [ 1.632745] pci 0000:02:00.0: BAR 0: error updating (0xe0200004 != 0xffffffff) >> > > > > > > > [ 1.632750] pci 0000:02:00.0: BAR 0: error updating (high 0x000000 != 0xffffffff) >> > > > > > > > >> > > > > > > > and the WiFi drivers fail to initialise with what appears to me to be >> > > > > > > > errors related to the bus rather than to the drivers themselves: >> > > > > > > > >> > > > > > > > [ 3.509878] ath: phy0: Mac Chip Rev 0xfffc0.f is not supported by this driver >> > > > > > > > [ 3.517049] ath: phy0: Unable to initialize hardware; initialization status: -95 >> > > > > > > > [ 3.524473] ath9k 0000:01:00.0: Failed to initialize device >> > > > > > > > [ 3.530081] ath9k: probe of 0000:01:00.0 failed with error -95 >> > > > > > > > [ 3.536012] ath10k_pci 0000:02:00.0: of_irq_parse_pci: failed with rc=134 >> > > > > > > > [ 3.543049] pci 0000:00:02.0: enabling device (0140 -> 0142) >> > > > > > > > [ 3.548735] ath10k_pci 0000:02:00.0: can't change power state from D3hot to D0 (config space inaccessible) >> > > > > > > > [ 3.588592] ath10k_pci 0000:02:00.0: failed to wake up device : -110 >> > > > > > > > [ 3.595098] ath10k_pci: probe of 0000:02:00.0 failed with error -110 >> > > > > > > > >> > > > > > > > lspci looks OK, though: >> > > > > > > > >> > > > > > > > # lspci >> > > > > > > > 00:01.0 PCI bridge: Marvell Technology Group Ltd. Device 6820 (rev 04) >> > > > > > > > 00:02.0 PCI bridge: Marvell Technology Group Ltd. Device 6820 (rev 04) >> > > > > > > > 00:03.0 PCI bridge: Marvell Technology Group Ltd. Device 6820 (rev 04) >> > > > > > > > 01:00.0 Network controller: Qualcomm Atheros AR9287 Wireless Network Adapter (PCI-Express) (rev 01) >> > > > > > > > 02:00.0 Network controller: Qualcomm Atheros QCA986x/988x 802.11ac Wireless Network Adapter (rev ff) >> > > > > > > > >> > > > > > > > Does anyone have any clue what could be going on here? Is this a bug, or >> > > > > > > > did I miss something in my config or other initialisation? I've tried >> > > > > > > > with both the stock u-boot distributed with the board, and with an >> > > > > > > > upstream u-boot from latest master; doesn't seem to make any different. >> > > > > > > Can you try turning off CONFIG_PCIEASPM? We had a similar recent >> > > > > > > report at https://bugzilla.kernel.org/show_bug.cgi?id=209833 but I >> > > > > > > don't think we have a fix yet. >> > > > > > Yes! Turning that off does indeed help! Thanks a bunch :) >> > > > > > >> > > > > > You mention that bisecting this would be helpful - I can try that >> > > > > > tomorrow; any idea when this was last working? >> > > > > OK, so I tried to bisect this, but, erm, I couldn't find a working >> > > > > revision to start from? I went all the way back to 4.10 (which is the >> > > > > first version to include the device tree file for the Omnia), and even >> > > > > on that, the wireless cards were failing to initialise with ASPM >> > > > > enabled... >> > > > I have no personal experience with this device; all I know is that the >> > > > bugzilla suggests that it worked in v5.4, which isn't much help. >> > > > >> > > > Possibly the apparent regression was really a .config change, i.e., >> > > > CONFIG_PCIEASPM was disabled in the v5.4 kernel vtolkm@ tested and it >> > > > "worked" but got enabled later and it started failing? >> > > Yeah, I suspect so. The OpenWrt config disables CONFIG_PCIEASPM by >> > > default and only turns it on for specific targets. So I guess that it's >> > > most likely that this has never worked... >> > > >> > > > Maybe the debug patch below would be worth trying to see if it makes >> > > > any difference? If it *does* help, try omitting the first hunk to see >> > > > if we just need to apply the quirk_enable_clear_retrain_link() quirk. >> > > Tried, doesn't help... >> > > >> > > -Toke >> > >> > Found this patch >> > >> > https://github.com/openwrt/openwrt/blob/7c0496f29bed87326f1bf591ca25ace82373cfc7/target/linux/mvebu/patches-5.4/405-PCI-aardvark-Improve-link-training.patch >> > >> > that mentions the Compex WLE900VX card, which reading the lspci verbose >> > output from the bugtracker seems to the device being troubled. >> >> Interesting. Indeed, the Compex WLE900VX card seems to have the >> Qualcomm Atheros QCA9880 on it, and it looks like Toke's system has >> the same device in it. >> >> The patch you mention (https://git.kernel.org/linus/43fc679ced18) is >> for aardvark, so of course doesn't help mvebu. >> >> PCIe hardware is supposed to automatically negotiate the highest link >> speed supported by both ends. But software *is* allowed to set an >> upper limit (the Target Link Speed in Link Control 2). If we initiate >> a retrain and the link doesn't come back up, I wonder if we should try >> to help the hardware out by using Target Link Speed to limit to a >> lower speed and attempting another retrain, something like this hacky >> patch: (please collect the dmesg log if you try this) > > My experience with that WLE900VX card, aardvark driver and aspm code: > > Link training in GEN2 mode for this card succeed only once after reset. > Repeated link retraining fails and it fails even when aardvark is > reconfigured to GEN1 mode. Reset via PERST# signal is required to have > working link training. > > What I did in aardvark driver: Set mode to GEN2, do link training. If > success read "negotiated link speed" from "Link Control Status Register" > (for WLE900VX it is 0x1 - GEN1) and set it into aardvark. And then > retrain link again (for WLE900VX now it would be at GEN1). After that > card is stable and all future retraining (e.g. from aspm.c) also passes. > > If I do not change aardvark mode from GEN2 to GEN1 the second link > training fails. And if I change mode to GEN1 after this failed link > training then nothing happen, link training do not success. > > So just speculation now... In current setup initialization of card does > one link training at GEN2. Then aspm.c is called which is doing second > link retraining at GEN2. And if it fails then below patch issue third > link retraining at GEN1. If A38x/pci-mvebu has same problem as aardvark > then second link retraining must be at GEN1 (not GEN2) to workaround > this issue. > > Bjorn, Toke: what about trying to hack aspm.c code to never do link > retraining at GEN2 speed? And always force GEN1 speed prior link > training? Sounds like a plan. I poked around in aspm.c and must confess to being a bit lost in the soup of registers ;) So if one of you can cook up a patch, that would be most helpful! -Toke