On Tue, Mar 26, 2024 at 09:52:28AM +0800, Kai-Heng Feng wrote: > On Tue, Mar 26, 2024 at 3:02 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Mon, Mar 25, 2024 at 10:02:27AM +0800, Kai-Heng Feng wrote: > > > On Sat, Mar 23, 2024 at 12:43 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > ... > > > > If that's the case, why do the > > > > masking in the suspend/resume callbacks? > > > > > > Because there's no functional impact when the error happens, other > > > than suspend/resume. > > > > Oh, I think I see. Is this accurate? > > > > Due to a hardware defect in GL975x, config accesses when ASPM is > > enabled frequently cause Replay Timer Timeouts in the Port leading > > to the device. > > > > These are Correctable Errors, so the Downstream Port logs it in its > > PCI_ERR_COR_STATUS and, when the error is not masked, sends an > > ERR_COR message upstream. The message terminates at a Root Port, > > which may generate an AER interrupt so the OS can log it. > > > > The Correctable Error logging is an annoyance but normally not a > > major issue. But when the AER interrupt happens during suspend, it > > can prevent the system from suspending. > > That's totally the case here. > > This brings up another different but related topic - should the port > driver disable AER/DPC IRQ during suspend? > We've discussed this many times, I still think that's the right > approach to "quiesce" many unexpected errors during system state > transition. Maybe so. We can continue that in the context of that patch. Maybe it needs to be reposted; I can't remember where it's at right now. > > I think we should log a hint in dmesg that we're masking > > PCI_ERR_COR_REP_TIMER because the error will still be logged in the > > PCI_ERR_COR_STATUS register, and that will be visible via lspci, and a > > dmesg hint will save debugging time when people report that. > > Sure. Where do you think it's a better place to implement the quirk? I > Assume PCI quirk is a better place than driver's probe routine? Yes, I think drivers/pci/quirks.c is a better place so we can mask it even if the driver isn't loaded. Users can still run lspci and see these errors even if the driver isn't loaded. Bjorn