On Sat, Jan 6, 2024 at 5:19 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Thu, Dec 21, 2023 at 11:21:47AM +0800, Kai-Heng Feng wrote: > > Spamming `lspci -vv` can still observe the replay timer timeout error > > even after commit 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the > > replay timer timeout of AER"), albeit with a lower reproduce rate. > > I'm not sure what this is telling me. By "spamming `lspci -vv`, do > you mean that if you run lspci continually, you still see Replay Timer > Timeout logged, e.g., > > CESta: ... Timeout+ Yes it's logged and the AER IRQ is raised. > > 015c9cbcf0ad uses hard-coded PCI_GLI_9750_CORRERR_MASK offset and > PCI_GLI_9750_CORRERR_MASK_REPLAY_TIMER_TIMEOUT value, which look like > they *could* be PCI_ERR_COR_MASK and PCI_ERR_COR_REP_TIMER, but > without the lspci output I can't tell for sure. If they are, it would > be nice to use the generic macros instead of defining new ones so it's > easier to analyze PCI_ERR_COR_MASK usage. > > If 015c9cbcf0ad is updating the generic PCI_ERR_COR_MASK, it should > only prevent sending ERR_COR. It should not affect the *logging* in > PCI_ERR_COR_STATUS (see PCIe r6.0, sec 6.2.3.2.2), so it shouldn't > affect the lspci output. PCI_GLI_9750_CORRERR_MASK is specific to GLI 975x devices, so it doesn't conform to generic PCI_ERR_COR_STATUS behavior. The Timeout is masked with or without commit 015c9cbcf0ad: CEMsk: ... Timeout+ > > > Such AER interrupt can still prevent the system from suspending, so let > > root port mask and unmask replay timer timeout during suspend and > > resume, respectively. > > 015c9cbcf0ad looks like it masks PCI_ERR_COR_REP_TIMER in the gl975x > Endpoint, while this patch masks it in the upstream bridge (which > might be either a Root Port or a Switch Downstream Port, so the > subject and this sentence are not quite right). OK, will change it to upstream bridge in next revision. > > 015c9cbcf0ad says it is related to a hardware defect, and maybe this > patch is also (mention it if so). Both patches can prevent ERR_COR > messages and the eventual AER interrupt, depending on whether the > Downstream Port or the Endpoint detects the Replay Timer Timeout. > Maybe this should have a Fixes: tag for 015c9cbcf0ad to try to keep > these together? Sure. This patch is intend to cover more ground based on 015c9cbcf0ad. > > If 015c9cbcf0ad is actually updating PCI_ERR_COR_MASK, it would be > nice to clean that up, too. And maybe PCI_ERR_COR_REP_TIMER should be > masked/restored at the same place for both the Downstream Port and the > Endpoint? Since PCI_ERR_COR_REP_TIMER is already masked before 015c9cbcf0ad, so I didn't think that's necessary. Do you think it should still be masked just to be safe? > > > +#ifdef CONFIG_PCIEAER > > +static void mask_replay_timer_timeout(struct pci_dev *pdev) > > +{ > > + struct pci_dev *parent = pci_upstream_bridge(pdev); > > + u32 val; > > + > > + if (!parent || !parent->aer_cap) > > + return; > > + > > + pci_read_config_dword(parent, parent->aer_cap + PCI_ERR_COR_MASK, &val); > > + val |= PCI_ERR_COR_REP_TIMER; > > + pci_write_config_dword(parent, parent->aer_cap + PCI_ERR_COR_MASK, val); > > +} > > + > > +static void unmask_replay_timer_timeout(struct pci_dev *pdev) > > +{ > > + struct pci_dev *parent = pci_upstream_bridge(pdev); > > + u32 val; > > + > > + if (!parent || !parent->aer_cap) > > + return; > > + > > + pci_read_config_dword(pdev, parent->aer_cap + PCI_ERR_COR_MASK, &val); > > + val &= ~PCI_ERR_COR_REP_TIMER; > > I think I would save the previous PCI_ERR_COR_REP_TIMER value and > restore it here, so it is preserved if there is ever a generic > interface via sysfs or similar to manage correctable error masking. Makes sense, will do in next revision. Kai-Heng > > > + pci_write_config_dword(pdev, parent->aer_cap + PCI_ERR_COR_MASK, val); > > +}