On Fri, 9 Aug 2024, Maciej W. Rozycki wrote: > When `pcie_failed_link_retrain' has failed to retrain the link by hand > it leaves the link speed restricted to 2.5GT/s, which will then affect > any device that has been plugged in later on, which may not suffer from > the problem that caused the speed restriction to have been attempted. > Consequently such a downstream device will suffer from an unnecessary > communication throughput limitation and therefore performance loss. > > Remove the speed restriction then and revert the Link Control 2 register > to its original state if link retraining with the speed restriction in > place has failed. Retrain the link again afterwards to remove any > residual state, ignoring the result as it's supposed to fail anyway. > > Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures") > Reported-by: Matthew W Carlis <mattc@xxxxxxxxxxxxxxx> > Link: https://lore.kernel.org/r/20240806000659.30859-1-mattc@xxxxxxxxxxxxxxx/ > Link: https://lore.kernel.org/r/20240722193407.23255-1-mattc@xxxxxxxxxxxxxxx/ > Signed-off-by: Maciej W. Rozycki <macro@xxxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx # v6.5+ > --- > New change in v2. > --- > drivers/pci/quirks.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > linux-pcie-failed-link-retrain-fail-unclamp.diff > Index: linux-macro/drivers/pci/quirks.c > =================================================================== > --- linux-macro.orig/drivers/pci/quirks.c > +++ linux-macro/drivers/pci/quirks.c > @@ -66,7 +66,7 @@ > * apply this erratum workaround to any downstream ports as long as they > * support Link Active reporting and have the Link Control 2 register. > * Restrict the speed to 2.5GT/s then with the Target Link Speed field, > - * request a retrain and wait 200ms for the data link to go up. > + * request a retrain and check the result. > * > * If this turns out successful and we know by the Vendor:Device ID it is > * safe to do so, then lift the restriction, letting the devices negotiate > @@ -74,6 +74,10 @@ > * firmware may have already arranged and lift it with ports that already > * report their data link being up. > * > + * Otherwise revert the speed to the original setting and request a retrain > + * again to remove any residual state, ignoring the result as it's supposed > + * to fail anyway. > + * > * Return TRUE if the link has been successfully retrained, otherwise FALSE. > */ > bool pcie_failed_link_retrain(struct pci_dev *dev) > @@ -92,6 +96,8 @@ bool pcie_failed_link_retrain(struct pci > pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta); > if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) == > PCI_EXP_LNKSTA_LBMS) { > + u16 oldlnkctl2 = lnkctl2; > + > pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n"); > > lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS; > @@ -100,6 +106,9 @@ bool pcie_failed_link_retrain(struct pci > > if (pcie_retrain_link(dev, false)) { > pci_info(dev, "retraining failed\n"); > + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, > + oldlnkctl2); > + pcie_retrain_link(dev, false); Hi again all, While rebasing the bandwidth controller patches, I revisited this line and realized using false for use_lt is not optimal here. It would definitely seem better to use LT (true) in this case because it likely results in much shorter wait. In hotplug cases w/o a peer device, DLLLA will just make the wait last until the timeout, whereas LT would short-circuit the training almost right away I think (mostly guessing with limited knowledge about LTSSM). We are no longer even expecting the link to come up at this point so using DLLLA seems illogical. Do you agree? -- i.