On Thursday 06 May 2021 17:31:12 Pali Rohár wrote: > Trying to start a new PIO transfer by writing value 0 in PIO_START register > when previous transfer has not yet completed (which is indicated by value 1 > in PIO_START) causes an External Abort on CPU, which results in kernel > panic: > > SError Interrupt on CPU0, code 0xbf000002 -- SError > Kernel panic - not syncing: Asynchronous SError Interrupt > > To prevent kernel panic, it is required to reject a new PIO transfer when > previous one has not finished yet. > > If previous PIO transfer is not finished yet, the kernel may issue a new > PIO request only if the previous PIO transfer timed out. > > In the past the root cause of this issue was incorrectly identified (as it > often happens during link retraining or after link down event) and special > hack was implemented in Trusted Firmware to catch all SError events in EL3, > to ignore errors with code 0xbf000002 and not forwarding any other errors > to kernel and instead throw panic from EL3 Trusted Firmware handler. > > Links to discussion and patches about this issue: > https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/commit/?id=3c7dcdac5c50 > https://lore.kernel.org/linux-pci/20190316161243.29517-1-repk@xxxxxxxxxxxx/ > https://lore.kernel.org/linux-pci/971be151d24312cc533989a64bd454b4@xxxxxxxxxxx/ > https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/1541 > > But the real cause was the fact that during link retraning or after link > down event the PIO transfer may take longer time, up to the 1.44s until it > times out. This increased probability that a new PIO transfer would be > issued by kernel while previous one has not finished yet. > > After applying this change into the kernel, it is possible to revert the > mentioned TF-A hack and SError events do not have to be caught in TF-A EL3. > > Signed-off-by: Pali Rohár <pali@xxxxxxxxxx> > Reviewed-by: Marek Behún <kabel@xxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx # 7fbcb5da811b ("PCI: aardvark: Don't rely on jiffies while holding spinlock") Hello! Could you please review at least this patch? It is fixing kernel panic and to prevent future kernel crashes I would really suggest to merge this one patch into 5.13 queue. > --- > drivers/pci/controller/pci-aardvark.c | 49 ++++++++++++++++++++++----- > 1 file changed, 40 insertions(+), 9 deletions(-) > > diff --git a/drivers/pci/controller/pci-aardvark.c b/drivers/pci/controller/pci-aardvark.c > index 051b48bd7985..e3f5e7ab7606 100644 > --- a/drivers/pci/controller/pci-aardvark.c > +++ b/drivers/pci/controller/pci-aardvark.c > @@ -514,7 +514,7 @@ static int advk_pcie_wait_pio(struct advk_pcie *pcie) > udelay(PIO_RETRY_DELAY); > } > > - dev_err(dev, "config read/write timed out\n"); > + dev_err(dev, "PIO read/write transfer time out\n"); > return -ETIMEDOUT; > } > > @@ -657,6 +657,35 @@ static bool advk_pcie_valid_device(struct advk_pcie *pcie, struct pci_bus *bus, > return true; > } > > +static bool advk_pcie_pio_is_running(struct advk_pcie *pcie) > +{ > + struct device *dev = &pcie->pdev->dev; > + > + /* > + * Trying to start a new PIO transfer when previous has not completed > + * cause External Abort on CPU which results in kernel panic: > + * > + * SError Interrupt on CPU0, code 0xbf000002 -- SError > + * Kernel panic - not syncing: Asynchronous SError Interrupt > + * > + * Functions advk_pcie_rd_conf() and advk_pcie_wr_conf() are protected > + * by raw_spin_lock_irqsave() at pci_lock_config() level to prevent > + * concurrent calls at the same time. But because PIO transfer may take > + * about 1.5s when link is down or card is disconnected, it means that > + * advk_pcie_wait_pio() does not always have to wait for completion. > + * > + * Some versions of ARM Trusted Firmware handles this External Abort at > + * EL3 level and mask it to prevent kernel panic. Relevant TF-A commit: > + * https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/commit/?id=3c7dcdac5c50 > + */ > + if (advk_readl(pcie, PIO_START)) { > + dev_err(dev, "Previous PIO read/write transfer is still running\n"); > + return true; > + } > + > + return false; > +} > + > static int advk_pcie_rd_conf(struct pci_bus *bus, u32 devfn, > int where, int size, u32 *val) > { > @@ -673,9 +702,10 @@ static int advk_pcie_rd_conf(struct pci_bus *bus, u32 devfn, > return pci_bridge_emul_conf_read(&pcie->bridge, where, > size, val); > > - /* Start PIO */ > - advk_writel(pcie, 0, PIO_START); > - advk_writel(pcie, 1, PIO_ISR); > + if (advk_pcie_pio_is_running(pcie)) { > + *val = 0xffffffff; > + return PCIBIOS_SET_FAILED; > + } > > /* Program the control register */ > reg = advk_readl(pcie, PIO_CTRL); > @@ -694,7 +724,8 @@ static int advk_pcie_rd_conf(struct pci_bus *bus, u32 devfn, > /* Program the data strobe */ > advk_writel(pcie, 0xf, PIO_WR_DATA_STRB); > > - /* Start the transfer */ > + /* Clear PIO DONE ISR and start the transfer */ > + advk_writel(pcie, 1, PIO_ISR); > advk_writel(pcie, 1, PIO_START); > > ret = advk_pcie_wait_pio(pcie); > @@ -734,9 +765,8 @@ static int advk_pcie_wr_conf(struct pci_bus *bus, u32 devfn, > if (where % size) > return PCIBIOS_SET_FAILED; > > - /* Start PIO */ > - advk_writel(pcie, 0, PIO_START); > - advk_writel(pcie, 1, PIO_ISR); > + if (advk_pcie_pio_is_running(pcie)) > + return PCIBIOS_SET_FAILED; > > /* Program the control register */ > reg = advk_readl(pcie, PIO_CTRL); > @@ -763,7 +793,8 @@ static int advk_pcie_wr_conf(struct pci_bus *bus, u32 devfn, > /* Program the data strobe */ > advk_writel(pcie, data_strobe, PIO_WR_DATA_STRB); > > - /* Start the transfer */ > + /* Clear PIO DONE ISR and start the transfer */ > + advk_writel(pcie, 1, PIO_ISR); > advk_writel(pcie, 1, PIO_START); > > ret = advk_pcie_wait_pio(pcie); > -- > 2.20.1 >