On Mon, Jun 12, 2023 at 09:56:27AM +0530, Siddharth Vadapalli wrote: > > > On 09/06/23 23:09, Bjorn Helgaas wrote: > > On Wed, Jun 07, 2023 at 02:44:27PM +0530, Siddharth Vadapalli wrote: > >> The Link Retraining process is initiated to account for the Gen2 defect in > >> the Cadence PCIe controller in J721E SoC. The errata corresponding to this > >> is i2085, documented at: > >> https://www.ti.com/lit/er/sprz455c/sprz455c.pdf > >> > >> The existing workaround implemented for the errata waits for the Data Link > >> initialization to complete and assumes that the link retraining process > >> at the Physical Layer has completed. However, it is possible that the > >> Physical Layer training might be ongoing as indicated by the > >> PCI_EXP_LNKSTA_LT bit in the PCI_EXP_LNKSTA register. > >> > >> Fix the existing workaround, to ensure that the Physical Layer training > >> has also completed, in addition to the Data Link initialization. > >> > >> Fixes: 4740b969aaf5 ("PCI: cadence: Retrain Link to work around Gen2 training defect") > >> Signed-off-by: Siddharth Vadapalli <s-vadapalli@xxxxxx> > >> Reviewed-by: Vignesh Raghavendra <vigneshr@xxxxxx> > >> --- > >> > >> Hello, > >> > >> This patch is based on linux-next tagged next-20230606. > >> > >> v2: > >> https://lore.kernel.org/r/20230315070800.1615527-1-s-vadapalli@xxxxxx/ > >> Changes since v2: > >> - Merge the cdns_pcie_host_training_complete() function with the > >> cdns_pcie_host_wait_for_link() function, as suggested by Bjorn > >> for the v2 patch. > >> - Add dev_err() to notify when Link Training fails, since this is a > >> fatal error and proceeding from this point will almost always crash > >> the kernel. > >> > >> v1: > >> https://lore.kernel.org/r/20230102075656.260333-1-s-vadapalli@xxxxxx/ > >> Changes since v1: > >> - Collect Reviewed-by tag from Vignesh Raghavendra. > >> - Rebase on next-20230315. > >> > >> Regards, > >> Siddharth. > >> > >> .../controller/cadence/pcie-cadence-host.c | 20 +++++++++++++++++++ > >> 1 file changed, 20 insertions(+) > >> > >> diff --git a/drivers/pci/controller/cadence/pcie-cadence-host.c b/drivers/pci/controller/cadence/pcie-cadence-host.c > >> index 940c7dd701d6..70a5f581ff4f 100644 > >> --- a/drivers/pci/controller/cadence/pcie-cadence-host.c > >> +++ b/drivers/pci/controller/cadence/pcie-cadence-host.c > >> @@ -12,6 +12,8 @@ > >> > >> #include "pcie-cadence.h" > >> > >> +#define LINK_RETRAIN_TIMEOUT HZ > >> + > >> static u64 bar_max_size[] = { > >> [RP_BAR0] = _ULL(128 * SZ_2G), > >> [RP_BAR1] = SZ_2G, > >> @@ -80,8 +82,26 @@ static struct pci_ops cdns_pcie_host_ops = { > >> static int cdns_pcie_host_wait_for_link(struct cdns_pcie *pcie) > >> { > >> struct device *dev = pcie->dev; > >> + unsigned long end_jiffies; > >> + u16 link_status; > >> int retries; > >> > >> + /* Wait for link training to complete */ > >> + end_jiffies = jiffies + LINK_RETRAIN_TIMEOUT; > >> + do { > >> + link_status = cdns_pcie_rp_readw(pcie, CDNS_PCIE_RP_CAP_OFFSET + PCI_EXP_LNKSTA); > >> + if (!(link_status & PCI_EXP_LNKSTA_LT)) > >> + break; > >> + usleep_range(0, 1000); > >> + } while (time_before(jiffies, end_jiffies)); > >> + > >> + if (!(link_status & PCI_EXP_LNKSTA_LT)) { > >> + dev_info(dev, "Link training complete\n"); > >> + } else { > >> + dev_err(dev, "Fatal! Link training incomplete\n"); > >> + return -ETIMEDOUT; > >> + } > > > > Can I have a brown paper bag, please? I totally blew it here, and I'm > > sorry. > > > > You took my advice by combining this with the existing > > cdns_pcie_host_wait_for_link(), but I think my advice was poor because > > (a) now this additional wait is not clearly connected with the > > erratum, and (b) it affects devices that don't have the erratum. > > > > IIUC, this is all part of a workaround for the i2085 erratum. The > > original workaround, 4740b969aaf5 ("PCI: cadence: Retrain Link to work > > around Gen2 training defect"), added this: > > > > if (!ret && rc->quirk_retrain_flag) > > ret = cdns_pcie_retrain(pcie); > > > > I think the wait for link train to complete should also be in > > cdns_pcie_retrain() so it's clearly connected with the quirk, which > > also means we'd only do the wait for devices with the erratum. > > > > Which is EXACTLY what your first patch did, and I missed it. I am > > very sorry. I guess maybe I thought cdns_pcie_retrain() was a > > general-purpose thing, but in fact it's only used for this quirk. > > With the current approach implemented in this patch, I could do the following: > In the cdns_pcie_host_wait_for_link() function, I obtain the reference to the > struct cdns_pcie_rc *rc, using: > struct cdns_pcie_rc *rc = container_of(pcie, struct cdns_pcie_rc, pcie); > followed by checking if the quirk "quirk_retrain_flag" is set, before proceeding > with the Link Training check added by this patch. With this, only the > controllers with the quirk will check for the Link Training completion before > proceeding. However, the difference with this new approach compared to the > approach in the v2 patch is that in this new approach, even in the Link Training > Phase, the Link Training check is performed for the controllers with the quirk, > unlike the v2 patch where the Link Training check was performed only during the > Link Retraining Phase through the cdns_pcie_retrain() function. > > Also, based on Mani's suggestion, I have measured the latency introduced by the > Link Training check for both quirky and non-quirky controllers at: > https://lore.kernel.org/r/a63fc8b0-581b-897f-cac6-cb0a0e82c63e@xxxxxx/ > If the latency is acceptable, then the current implementation in this v3 patch > could be fine too. > > Kindly let me know which approach among the following seems to be the best one: > 1. The approach implemented in v2 patch (I will make minor changes to the patch > to print out the "Fatal" error, so that users will be informed of the cause of > the crash, followed by posting a v4 patch with this change). > 2. The current implementation in the v3 patch with a check added to see if the > controller has the quirk_retrain_flag set, before proceeding with the Link > Training check. > 3. The current implementation in the v3 patch as is, without any modification, > if the latency introduced is not a concern and the sanity check for Link > Training completion for non-quirky controllers appears acceptable. The point is, you stated it yourself that the non-quirky path is broken too in its *current* form, I don't think there is any other option on the table other than (3) (unless we want to rely on probe time timing to hide the issue; that to me it is not even considerable as an option). Lorenzo