On Sat, Jan 13, 2024 at 1:37 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Fri, Jan 12, 2024 at 01:14:42PM +0800, Kai-Heng Feng wrote: > > On Sat, Jan 6, 2024 at 5:19 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > > > On Thu, Dec 21, 2023 at 11:21:47AM +0800, Kai-Heng Feng wrote: > > > > Spamming `lspci -vv` can still observe the replay timer timeout error > > > > even after commit 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the > > > > replay timer timeout of AER"), albeit with a lower reproduce rate. > > > > > > I'm not sure what this is telling me. By "spamming `lspci -vv`, do > > > you mean that if you run lspci continually, you still see Replay Timer > > > Timeout logged, e.g., > > > > > > CESta: ... Timeout+ > > > > Yes it's logged and the AER IRQ is raised. > > IIUC the AER IRQ is the important thing. > > Neither 015c9cbcf0ad nor this patch affects logging in > PCI_ERR_COR_STATUS, so the lspci output won't change and mentioning it > here doesn't add useful information. You are right. That's just a way to access config space to reproduce the issue. > > I'd suggest more specific wording than "spamming `lspci -vv`", e.g., > > 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the replay timer > timeout of AER") masks Replay Timer Timeout errors at the GL975x > Endpoint. When the Endpoint detects these errors, it still logs > them in its PCI_ERR_COR_STATUS, but masking prevents it from sending > ERR_COR messages upstream. > > The Downstream Port leading to a GL975x Endpoint is unaffected by > 015c9cbcf0ad. Previously, when that Port detected a Replay Timer > Timeout, it sent an ERR_COR message upstream, which eventually > caused an AER IRQ, which prevented the system from suspending. > > Mask Replay Timer Timeout errors at the Downstream Port. The errors > will still be logged in PCI_ERR_COR_STATUS, but no ERR_COR will be > sent. That's phrased much better then mine :) > > > > 015c9cbcf0ad uses hard-coded PCI_GLI_9750_CORRERR_MASK offset and > > > PCI_GLI_9750_CORRERR_MASK_REPLAY_TIMER_TIMEOUT value, which look like > > > they *could* be PCI_ERR_COR_MASK and PCI_ERR_COR_REP_TIMER, but > > > without the lspci output I can't tell for sure. If they are, it would > > > be nice to use the generic macros instead of defining new ones so it's > > > easier to analyze PCI_ERR_COR_MASK usage. > > > > > > If 015c9cbcf0ad is updating the generic PCI_ERR_COR_MASK, it should > > > only prevent sending ERR_COR. It should not affect the *logging* in > > > PCI_ERR_COR_STATUS (see PCIe r6.0, sec 6.2.3.2.2), so it shouldn't > > > affect the lspci output. > > > > PCI_GLI_9750_CORRERR_MASK is specific to GLI 975x devices, so it > > doesn't conform to generic PCI_ERR_COR_STATUS behavior. > > *Could* 015c9cbcf0ad have used the generic PCI_ERR_COR_MASK to > accomplish the same effect? Is there an advantage to using the > device-specific PCI_GLI_9750_CORRERR_MASK? > > If masking via PCI_ERR_COR_MASK would work, that would be much better > because the PCI core can see, manage, and make that visible, e.g., via > sysfs. The core doesn't do that today, but people are working on it. I don't think so. Please see below. > > > > If 015c9cbcf0ad is actually updating PCI_ERR_COR_MASK, it would be > > > nice to clean that up, too. And maybe PCI_ERR_COR_REP_TIMER should be > > > masked/restored at the same place for both the Downstream Port and the > > > Endpoint? > > > > Since PCI_ERR_COR_REP_TIMER is already masked before 015c9cbcf0ad, > > so I didn't think that's necessary. Do you think it should still be > > masked just to be safe? > > Did you mean "PCI_ERR_COR_REP_TIMER is already masked *by* > 015c9cbcf0ad"? No. The PCI_ERR_COR_REP_TIMER is masked with or without 015c9cbcf0ad. That means before 015c9cbcf0ad, Reply Timeout error was reported with PCI_ERR_COR_REP_TIMER masked. So using PCI_GLI_9750_CORRERR_MASK is necessary for the endpoint. > > If masking PCI_ERR_COR_REP_TIMER using the generic PCI_ERR_COR_MASK in > the GL975x would have the same effect as masking it with > PCI_GLI_9750_CORRERR_MASK, then I think you should *only* use the > generic PCI_ERR_COR_MASK. > > No need to do both if the generic one is sufficient. And I think both > should be done in the same place since they're basically solving the > same problem, just at both ends of the link. Do you mean only mask PCI_GLI_9750_CORRERR_MASK during suspend? That will not be ideal because accessing its config space (e.g. `lspci -vv`) will have many errors logged. Kai-Heng > > Bjorn