On Thu, Dec 21, 2023 at 11:21:47AM +0800, Kai-Heng Feng wrote: > Spamming `lspci -vv` can still observe the replay timer timeout error > even after commit 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the > replay timer timeout of AER"), albeit with a lower reproduce rate. I'm not sure what this is telling me. By "spamming `lspci -vv`, do you mean that if you run lspci continually, you still see Replay Timer Timeout logged, e.g., CESta: ... Timeout+ 015c9cbcf0ad uses hard-coded PCI_GLI_9750_CORRERR_MASK offset and PCI_GLI_9750_CORRERR_MASK_REPLAY_TIMER_TIMEOUT value, which look like they *could* be PCI_ERR_COR_MASK and PCI_ERR_COR_REP_TIMER, but without the lspci output I can't tell for sure. If they are, it would be nice to use the generic macros instead of defining new ones so it's easier to analyze PCI_ERR_COR_MASK usage. If 015c9cbcf0ad is updating the generic PCI_ERR_COR_MASK, it should only prevent sending ERR_COR. It should not affect the *logging* in PCI_ERR_COR_STATUS (see PCIe r6.0, sec 6.2.3.2.2), so it shouldn't affect the lspci output. > Such AER interrupt can still prevent the system from suspending, so let > root port mask and unmask replay timer timeout during suspend and > resume, respectively. 015c9cbcf0ad looks like it masks PCI_ERR_COR_REP_TIMER in the gl975x Endpoint, while this patch masks it in the upstream bridge (which might be either a Root Port or a Switch Downstream Port, so the subject and this sentence are not quite right). 015c9cbcf0ad says it is related to a hardware defect, and maybe this patch is also (mention it if so). Both patches can prevent ERR_COR messages and the eventual AER interrupt, depending on whether the Downstream Port or the Endpoint detects the Replay Timer Timeout. Maybe this should have a Fixes: tag for 015c9cbcf0ad to try to keep these together? If 015c9cbcf0ad is actually updating PCI_ERR_COR_MASK, it would be nice to clean that up, too. And maybe PCI_ERR_COR_REP_TIMER should be masked/restored at the same place for both the Downstream Port and the Endpoint? > +#ifdef CONFIG_PCIEAER > +static void mask_replay_timer_timeout(struct pci_dev *pdev) > +{ > + struct pci_dev *parent = pci_upstream_bridge(pdev); > + u32 val; > + > + if (!parent || !parent->aer_cap) > + return; > + > + pci_read_config_dword(parent, parent->aer_cap + PCI_ERR_COR_MASK, &val); > + val |= PCI_ERR_COR_REP_TIMER; > + pci_write_config_dword(parent, parent->aer_cap + PCI_ERR_COR_MASK, val); > +} > + > +static void unmask_replay_timer_timeout(struct pci_dev *pdev) > +{ > + struct pci_dev *parent = pci_upstream_bridge(pdev); > + u32 val; > + > + if (!parent || !parent->aer_cap) > + return; > + > + pci_read_config_dword(pdev, parent->aer_cap + PCI_ERR_COR_MASK, &val); > + val &= ~PCI_ERR_COR_REP_TIMER; I think I would save the previous PCI_ERR_COR_REP_TIMER value and restore it here, so it is preserved if there is ever a generic interface via sysfs or similar to manage correctable error masking. > + pci_write_config_dword(pdev, parent->aer_cap + PCI_ERR_COR_MASK, val); > +}