On Fri, Nov 04, 2022 at 12:41:11AM +0100, Pali Rohár wrote: > On Thursday 03 November 2022 18:13:35 Bjorn Helgaas wrote: > > [+cc Pali] > > > > On Sat, Sep 17, 2022 at 01:03:38PM +0100, Maciej W. Rozycki wrote: > > > Attempt to handle cases such as with a downstream port of the ASMedia > > > ASM2824 PCIe switch where link training never completes and the link > > > continues switching between speeds indefinitely with the data link layer > > > never reaching the active state. > > > > > > It has been observed with a downstream port of the ASMedia ASM2824 Gen 3 > > > switch wired to the upstream port of the Pericom PI7C9X2G304 Gen 2 > > > switch, using a Delock Riser Card PCI Express x1 > 2 x PCIe x1 device, > > > P/N 41433, wired to a SiFive HiFive Unmatched board. In this setup the > > > switches are supposed to negotiate the link speed of preferably 5.0GT/s, > > > falling back to 2.5GT/s. > > > > > > Instead the link continues oscillating between the two speeds, at the > > > rate of 34-35 times per second, with link training reported repeatedly > > > active ~84% of the time. Forcibly limiting the target link speed to > > > 2.5GT/s with the upstream ASM2824 device however makes the two switches > > > communicate correctly. Removing the speed restriction afterwards makes > > > the two devices switch to 5.0GT/s then. > > > > > > Make use of these observations then and detect the inability to train > > > the link, by checking for the Data Link Layer Link Active status bit > > > being off while the Link Bandwidth Management Status indicating that > > > hardware has changed the link speed or width in an attempt to correct > > > unreliable link operation. > > > > > > Restrict the speed to 2.5GT/s then with the Target Link Speed field, > > > request a retrain and wait 200ms for the data link to go up. If this > > > turns out successful, then lift the restriction, letting the devices > > > negotiate a higher speed. > > > > > > Also check for a 2.5GT/s speed restriction the firmware may have already > > > arranged and lift it too with ports of devices known to continue working > > > afterwards, currently the ASM2824 only, that already report their data > > > link being up. > > > > This quirk is run at boot-time and resume-time. What happens after a > > Secondary Bus Reset, as is done by pci_reset_secondary_bus()? > > Flipping SBR bit can be done on any PCI-to-PCI bridge device and in this > topology there are following: PCIe Root Port, ASMedia PCIe Switch > Upstream Port, ASMedia PCIe Switch Downstream Port, Pericom PCIe Switch > Upstream Port, Pericom PCIe Switch Downstream Port. > (Maciej, I hope that this is whole topology and there is not some other > device of PCI-to-PCI bridge type in your setup; please correct me) > > Bjorn, to make it clear, on which device you mean to issue secondary bus > reset? IIUC, the problem is observed on the link between the ASM2824 downstream port and the PI7C9X2G304 upstream port, so my question is about asserting SBR on the ASM2824 downstream port. I think that should cause the link between ASM2824 and PI7C9X2G304 to go down and back up. Thanks for the question; I didn't notice before that this quirk applies to *all* devices. I'm a little queasy about trying to fix problems we have not observed. In this case, I think the hardware is *supposed* to establish a link at the highest supported speed automatically. If we need to work around a hardware bug, that's fine, but I'm not sure I want to blindly try to help things along. > Because I would not be surprised if different things happen when issuing > bus reset on different parts of that topology. > > > PCIe r6.0, sec 7.5.1.3.13, says "setting Secondary Bus Reset triggers > > a hot reset on the corresponding PCI Express Port". Sec 4.2.7 says > > LinkUp is 0 in the LTSSM Hot Reset state, and the Hot Reset state > > leads to Detect, so it looks like this reset would cause the link to > > go down and come back up. > > > > Can you tell if that's what happens? Does the link negotiation fail > > then, too? > > > > If it does fail then, I don't know how hard we need to work to fix it. > > Maybe we just accept it? Or maybe we need a "quirk-after-reset" phase > > or something? > > > > Bjorn