On 29/10/2020 20:30, Bjorn Helgaas wrote:
On Thu, Oct 29, 2020 at 12:12:21PM +0100, Toke Høiland-Jørgensen wrote:Pali Rohár <pali@xxxxxxxxxx> writes:I have been testing mainline kernel on Turris Omnia with two PCIe default cards (WLE200 and WLE900) and it worked fine. But I do not know if I had ASPM enabled or not. So it is working fine for you when CONFIG_PCIEASPM is disabled and whole issue is only when CONFIG_PCIEASPM is enabled?Yup, exactly. And I'm also currently testing with the default WLE200/900 cards... I just tried sticking an MT76-based WiFi card into the third PCI slot, and that doesn't come up either when I enable PCIEASPM.Huh. So IIUC, the following cases all try to retrain the link and it fails to come up again: - aardvark + WLE900VX (see commit 43fc679ced18) - mvebu + WLE200 - mvebu + WLE900 - mvebu + MT76 In all these cases, Linux was able to enumerate the NIC, which means the link was up when firmware handed it off. I think Linux decided the Common Clock Configuration was wrong, so it tried to fix it and retrain the link, and the link didn't come back up. I don't have "lspci -vv" output from all of them, but in vtolkm's case, the firmware handed off with: 00:02.0 Root Port to [bus 02] SlotClk+ CommClk+ 02:00.0 QCA986x/988x NIC SlotClk+ CommClk- Per spec (PCIe r5, sec 7.5.3.7), SlotClk is HwInit and CommClk is RW and should power up as 0. If I'm reading the implementation note correctly, if SlotClk is set on both ends of the link, software should set CommClk, so the config above *does* look wrong, and CommClk+ on the Root Port suggests that firmware set it. I think both the aardvark and mvebu systems probably use U-Boot. I don't know U-Boot at all, but I don't see anything in it that touches Link Control. I'm curious what happens if you put one of these cards in a PC. If anybody tries it, please collect the "sudo lspci -vv" and dmesg output. We could quirk these NICs to avoid the retrain, but since aardvark and mvebu have no obvious connection and WLE200/WLE900 and MT76 have no obvious connection, I doubt there's a simple hardware defect that explains all these. Maybe we're doing something wrong in the retrain, but obviously the link came up in the first place. AFAIK the only thing we're changing is the CommClk setting, and that looks legitimate per spec. Another experiment: build kernel without CONFIG_PCIEASPM, set $ROOT and $NIC appropriately, and try the following: # Set $ROOT and $NIC (update to match your system): # ROOT=00:02.0 # NIC=02:00.0 # Dump the Root Port and NIC Link registers: # setpci -s$ROOT CAP_EXP+0xc.l # Link Capabilities # setpci -s$ROOT CAP_EXP+0x10.w # Link Control # setpci -s$ROOT CAP_EXP+0x12.w # Link Status # setpci -s$NIC CAP_EXP+0xc.l # Link Capabilities # setpci -s$NIC CAP_EXP+0x10.w # Link Control # setpci -s$NIC CAP_EXP+0x12.w # Link Status # Retrain the link: # setpci -s$ROOT CAP_EXP+0x10.w=0x0020 # Link Control Retrain Link # sleep 1 # setpci -s$ROOT CAP_EXP+0x12.w # Link Status # setpci -s$NIC CAP_EXP+0x12.w # Link Status # Set CommClk+ and retrain the link: # setpci -s$NIC CAP_EXP+0x10.w=0x0040 # Link Control Common Clock # setpci -s$ROOT CAP_EXP+0x10.w=0x0040 # Link Control Common Clock # setpci -s$ROOT CAP_EXP+0x10.w=0x0060 # Link Control RL + CC # sleep 1 # setpci -s$ROOT CAP_EXP+0x12.w # Link Status # setpci -s$NIC CAP_EXP+0x12.w # Link Status
ROOT=00:02.0 NIC=02:00.0 setpci -s$ROOT CAP_EXP+0xc.l 0003ac12 setpci -s$ROOT CAP_EXP+0x10.w 0040 setpci -s$ROOT CAP_EXP+0x12.w 1011 setpci -s$NIC CAP_EXP+0xc.l 00036c11 setpci -s$NIC CAP_EXP+0x10.w 0000 setpci -s$NIC CAP_EXP+0x12.w 1011 setpci -s$ROOT CAP_EXP+0x10.w=0x0020 sleep 1 setpci -s$ROOT CAP_EXP+0x12.w 1011 setpci -s$NIC CAP_EXP+0x12.wsetpci: 0000:02:00.0: Instance #0 of Capability 0010 not found - there are no capabilities with that id.
setpci -s$NIC CAP_EXP+0x10.w=0x0040setpci: 0000:02:00.0: Instance #0 of Capability 0010 not found - there are no capabilities with that id.
setpci -s$ROOT CAP_EXP+0x10.w=0x0040 setpci -s$ROOT CAP_EXP+0x10.w=0x0060 sleep 1 setpci -s$ROOT CAP_EXP+0x12.w 1811 setpci -s$NIC CAP_EXP+0x12.wsetpci: 0000:02:00.0: Instance #0 of Capability 0010 not found - there are no capabilities with that id.
Attachment:
OpenPGP_0x729CFF47A416598B.asc
Description: application/pgp-keys
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature