On Tue, Mar 26, 2024 at 3:02 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Mon, Mar 25, 2024 at 10:02:27AM +0800, Kai-Heng Feng wrote: > > On Sat, Mar 23, 2024 at 12:43 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > On Thu, Mar 21, 2024 at 06:05:33PM +0800, Kai-Heng Feng wrote: > > > > On Sat, Jan 20, 2024 at 6:41 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > > On Thu, Jan 18, 2024 at 02:40:50PM +0800, Kai-Heng Feng wrote: > > > > > > On Sat, Jan 13, 2024 at 1:37 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > > > > On Fri, Jan 12, 2024 at 01:14:42PM +0800, Kai-Heng Feng wrote: > > > > > > > > On Sat, Jan 6, 2024 at 5:19 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > > > > > > On Thu, Dec 21, 2023 at 11:21:47AM +0800, Kai-Heng Feng wrote: > > > > > > > > > > Spamming `lspci -vv` can still observe the replay timer timeout error > > > > > > > > > > even after commit 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the > > > > > > > > > > replay timer timeout of AER"), albeit with a lower reproduce rate. > > > > > > > > > > > > > > > > > > I'm not sure what this is telling me. By "spamming `lspci -vv`, do > > > > > > > > > you mean that if you run lspci continually, you still see Replay Timer > > > > > > > > > Timeout logged, e.g., > > > > > > > > > > > > > > > > > > CESta: ... Timeout+ > > > > > > > > > > > > > > > > Yes it's logged and the AER IRQ is raised. > > > > > > > > > > > > > > IIUC the AER IRQ is the important thing. > > > > > > > > > > > > > > Neither 015c9cbcf0ad nor this patch affects logging in > > > > > > > PCI_ERR_COR_STATUS, so the lspci output won't change and mentioning it > > > > > > > here doesn't add useful information. > > > > > > > > > > > > You are right. That's just a way to access config space to reproduce > > > > > > the issue. > > > > > > > > > > Oh, I think I completely misunderstood you! I thought you were saying > > > > > that suspending the device caused the PCI_ERR_COR_REP_TIMER error, and > > > > > you happened to see that it was logged when you ran lspci. > > > > > > > > Both running lspci and suspending the device can observe the error, > > > > because both are accessing the config space. > > > > > > > > > But I guess you mean that running lspci actually *causes* the error? > > > > > I.e., lspci does a config access while we're suspending the device > > > > > causes the error, and the config access itself causes the error, which > > > > > causes the ERR_COR message and ultimately the AER interrupt, and that > > > > > interrupt prevents the system suspend. > > > > > > > > My point was that any kind of PCI config access can cause the error. > > > > Using lspci is just make the error more easier to reproduce. > > > > > > > > > If that's the case, I wonder if this is a generic problem that could > > > > > happen with *any* device, not just GL975x. > > > > > > > > For now, it's just GL975x. > > > > > > > > > What power state do we put the GL975x in during system suspend? > > > > > D3hot? D3cold? Is there anything that prevents config access while > > > > > we suspend it? > > > > > > > > The target device state is D3hot. > > > > However, the issue happens when the devices is in D0, when the PCI > > > > core is saving the device's config space. > > > > > > > > So I think the issue isn't related to the device state. > > > > > > > > > We do have dev->block_cfg_access, and there's a comment that says > > > > > "we're required to prevent config accesses during D-state > > > > > transitions," but I don't see it being used during D-state > > > > > transitions. > > > > > > > > Yes, there isn't any D-state change happens here. > > > > > > So the timeout happens sometimes on any config accesses to the device, > > > no matter what the power state is? > > > > Yes. > > > > > If that's the case, why do the > > > masking in the suspend/resume callbacks? > > > > Because there's no functional impact when the error happens, other > > than suspend/resume. > > Oh, I think I see. Is this accurate? > > Due to a hardware defect in GL975x, config accesses when ASPM is > enabled frequently cause Replay Timer Timeouts in the Port leading > to the device. > > These are Correctable Errors, so the Downstream Port logs it in its > PCI_ERR_COR_STATUS and, when the error is not masked, sends an > ERR_COR message upstream. The message terminates at a Root Port, > which may generate an AER interrupt so the OS can log it. > > The Correctable Error logging is an annoyance but normally not a > major issue. But when the AER interrupt happens during suspend, it > can prevent the system from suspending. That's totally the case here. This brings up another different but related topic - should the port driver disable AER/DPC IRQ during suspend? We've discussed this many times, I still think that's the right approach to "quiesce" many unexpected errors during system state transition. > > > > If it's not related to a power state change, it sounds like something > > > that should be a quirk or done at probe time. > > > > Sure, I'll change that to be done at probe time. > > In general, we want to log Correctable Errors because they give an > indication of link integrity. But I'm guessing that since this is > related to a GL975x defect, the errors occur pretty frequently and on > all systems, so they aren't an indication of poor link quality and > they only break suspend, annoy users, and cause problem reports. > > Masking them at suspend time would avoid the suspend/resume problem, > but we would still have the annoying logging and get the problem > reports. > > If this is all true, I think masking via a quirk is probably the right > thing. That way we won't get reports that "lspci causes errors" even > when the sdhci driver is not loaded. > > I think we should log a hint in dmesg that we're masking > PCI_ERR_COR_REP_TIMER because the error will still be logged in the > PCI_ERR_COR_STATUS register, and that will be visible via lspci, and a > dmesg hint will save debugging time when people report that. Sure. Where do you think it's a better place to implement the quirk? I Assume PCI quirk is a better place than driver's probe routine? Kai-Heng > > Bjorn