On 10.07.2024 17:09, George-Daniel Matei wrote: > Hi, > >>> Added aspm suspend/resume hooks that run >>> before and after suspend and resume to change >>> the ASPM states of the PCI bus in order to allow >>> the system suspend while trying to prevent card hangs >> >> Why is this needed? Is there a r8169 defect we're working around? >> A BIOS defect? Is there a problem report you can reference here? >> > > We encountered this issue while upgrading from kernel v6.1 to v6.6. > The system would not suspend with 6.6. We tracked down the problem to > the NIC of the device, mainly that the following code was removed in > 6.6: >> else if (tp->mac_version >= RTL_GIGA_MAC_VER_46) >> rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2); > For the listed devices, ASPM L1 is disabled entirely in 6.6. As for > the reason, L1 was observed to cause some problems > (https://bugzilla.kernel.org/show_bug.cgi?id=217814). We use a Raptor > Lake soc and it won't change residency if the NIC doesn't have L1 > enabled. I saw in 6.1 the following comment: With residency you refer to the package power saving state? >> Chips from RTL8168h partially have issues with L1.2, but seem >> to work fine with L1 and L1.1. > I was thinking that disabling/enabling L1.1 on the fly before/after > suspend could help mitigate the risk associated with L1/L1.1 . I know > that ASPM settings are exposed in sysfs and that this could be done > from outside the kernel, that was my first approach, but it was > suggested to me that this kind of workaround would be better suited > for quirks. I did around 1000 suspend/resume cycles of 16-30 seconds > each (correcting the resume dev->bus->self being configured twice > mistake) and did not notice any problems. What do you think, is this a > good approach ... ? > If the root cause really should be in the SoC's ASPM implementation, then: - Other systems with the same SoC may suffer from the same problem, but are not covered by the quirk. - The issue may occur also with other devices than a RTL8168 NIC. How about e.g. RTL8125? Or completely different PCI devices? What I understand so far from your description: W/o ASPM L1 the SoC doesn't change "residency". See comment above, please elaborate on this. And w/ ASPM L1 the NIC hangs on suspend? What's the dmesg entries related to this hang? Tx timeout? Or card not accessible at all? My perspective so far: It's a relatively complex quirk that covers only a part of the potentially affected systems, and the issue isn't well understood. And most likely there are lots of systems out there with a Raptor Lake CPU and a RTL8168 on board. Therefore it's surprising that there hasn't been a similar report before. >>> + //configure device >>> + pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL, >>> + PCI_EXP_LNKCTL_ASPMC, 0); >>> + >>> + pci_read_config_word(dev->bus->self, >>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, >>> + &val); >>> + val = val & ~PCI_L1SS_CTL1_L1SS_MASK; >>> + pci_write_config_word(dev->bus->self, >>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, >>> + val); >> Updates the parent (dev->bus->self) twice; was the first one supposed >> to update the device (dev)? > Yes, it was supposed to update the device (dev). It's my first time > sending a patch and I messed something up while doing some style > changes, I will correct it. I'm sorry for that. > >> This doesn't restore the state as it existed before suspend. Does >> this rely on other parts of restore to do that? > It operates on the assumption that after driver initialization > PCI_EXP_LNKCTL_ASPMC is 0 and that there are no states enabled in > CTL1. I did a lspci -vvv dump on the affected devices before and after > the quirks ran and saw no difference. This could be improved. > >> What is the RTL8168 chip version used on these systems? > It should be RTL8111H. > >> What's the root cause of the issue? >> A silicon bug on the host side? > I think it's the ASPM implementation of the soc. > >> ASPM L1 is disabled per default in r8169. So why is the patch needed >> at all? > Leaving it disabled all the time prevents the system from suspending. > This is not clear to me. You refer to STR? Why should a system not suspend just because one PCI device doesn't have ASPM L1 enabled? > Thank you, > George-Daniel Matei > > > > > > On Tue, Jul 9, 2024 at 12:15 AM Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote: >> >> On 08.07.2024 19:23, Bjorn Helgaas wrote: >>> [+cc r8169 folks] >>> >>> On Mon, Jul 08, 2024 at 03:38:15PM +0000, George-Daniel Matei wrote: >>>> Added aspm suspend/resume hooks that run >>>> before and after suspend and resume to change >>>> the ASPM states of the PCI bus in order to allow >>>> the system suspend while trying to prevent card hangs >>> >>> Why is this needed? Is there a r8169 defect we're working around? >>> A BIOS defect? Is there a problem report you can reference here? >>> >> >> Basically the same question from my side. Apparently such a workaround >> isn't needed on any other system. And Realtek NICs can be found on more >> or less every consumer system. What's the root cause of the issue? >> A silicon bug on the host side? >> >> What is the RTL8168 chip version used on these systems? >> >> ASPM L1 is disabled per default in r8169. So why is the patch needed >> at all? >> >>> s/Added/Add/ >>> >>> s/aspm/ASPM/ above >>> >>> s/PCI bus/device and parent/ >>> >>> Add period at end of sentence. >>> >>> Rewrap to fill 75 columns. >>> >>>> Signed-off-by: George-Daniel Matei <danielgeorgem@xxxxxxxxxxxx> >>>> --- >>>> drivers/pci/quirks.c | 142 +++++++++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 142 insertions(+) >>>> >>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c >>>> index dc12d4a06e21..aa3dba2211d3 100644 >>>> --- a/drivers/pci/quirks.c >>>> +++ b/drivers/pci/quirks.c >>>> @@ -6189,6 +6189,148 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56b0, aspm_l1_acceptable_latency >>>> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56b1, aspm_l1_acceptable_latency); >>>> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56c0, aspm_l1_acceptable_latency); >>>> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56c1, aspm_l1_acceptable_latency); >>>> + >>>> +static const struct dmi_system_id chromebox_match_table[] = { >>>> + { >>>> + .matches = { >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Brask"), >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), >>>> + } >>>> + }, >>>> + { >>>> + .matches = { >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Aurash"), >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), >>>> + } >>>> + }, >>>> + { >>>> + .matches = { >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Bujia"), >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), >>>> + } >>>> + }, >>>> + { >>>> + .matches = { >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Gaelin"), >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), >>>> + } >>>> + }, >>>> + { >>>> + .matches = { >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Gladios"), >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), >>>> + } >>>> + }, >>>> + { >>>> + .matches = { >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Hahn"), >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), >>>> + } >>>> + }, >>>> + { >>>> + .matches = { >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Jeev"), >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), >>>> + } >>>> + }, >>>> + { >>>> + .matches = { >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Kinox"), >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), >>>> + } >>>> + }, >>>> + { >>>> + .matches = { >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Kuldax"), >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), >>>> + } >>>> + }, >>>> + { >>>> + .matches = { >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Lisbon"), >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), >>>> + } >>>> + }, >>>> + { >>>> + .matches = { >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Moli"), >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), >>>> + } >>>> + }, >>>> + { } >>>> +}; >>>> + >>>> +static void rtl8169_suspend_aspm_settings(struct pci_dev *dev) >>>> +{ >>>> + u16 val = 0; >>>> + >>>> + if (dmi_check_system(chromebox_match_table)) { >>>> + //configure parent >>>> + pcie_capability_clear_and_set_word(dev->bus->self, >>>> + PCI_EXP_LNKCTL, >>>> + PCI_EXP_LNKCTL_ASPMC, >>>> + PCI_EXP_LNKCTL_ASPM_L1); >>>> + >>>> + pci_read_config_word(dev->bus->self, >>>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, >>>> + &val); >>>> + val = (val & ~PCI_L1SS_CTL1_L1SS_MASK) | >>>> + PCI_L1SS_CTL1_PCIPM_L1_2 | PCI_L1SS_CTL1_PCIPM_L1_2 | >>>> + PCI_L1SS_CTL1_ASPM_L1_1; >>>> + pci_write_config_word(dev->bus->self, >>>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, >>>> + val); >>>> + >>>> + //configure device >>>> + pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL, >>>> + PCI_EXP_LNKCTL_ASPMC, >>>> + PCI_EXP_LNKCTL_ASPM_L1); >>>> + >>>> + pci_read_config_word(dev, dev->l1ss + PCI_L1SS_CTL1, &val); >>>> + val = (val & ~PCI_L1SS_CTL1_L1SS_MASK) | >>>> + PCI_L1SS_CTL1_PCIPM_L1_2 | PCI_L1SS_CTL1_PCIPM_L1_2 | >>>> + PCI_L1SS_CTL1_ASPM_L1_1; >>>> + pci_write_config_word(dev, dev->l1ss + PCI_L1SS_CTL1, val); >>>> + } >>>> +} >>>> + >>>> +DECLARE_PCI_FIXUP_SUSPEND(PCI_VENDOR_ID_REALTEK, 0x8168, >>>> + rtl8169_suspend_aspm_settings); >>>> + >>>> +static void rtl8169_resume_aspm_settings(struct pci_dev *dev) >>>> +{ >>>> + u16 val = 0; >>>> + >>>> + if (dmi_check_system(chromebox_match_table)) { >>>> + //configure device >>>> + pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL, >>>> + PCI_EXP_LNKCTL_ASPMC, 0); >>>> + >>>> + pci_read_config_word(dev->bus->self, >>>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, >>>> + &val); >>>> + val = val & ~PCI_L1SS_CTL1_L1SS_MASK; >>>> + pci_write_config_word(dev->bus->self, >>>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, >>>> + val); >>>> + >>>> + //configure parent >>>> + pcie_capability_clear_and_set_word(dev->bus->self, >>>> + PCI_EXP_LNKCTL, >>>> + PCI_EXP_LNKCTL_ASPMC, 0); >>>> + >>>> + pci_read_config_word(dev->bus->self, >>>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, >>>> + &val); >>>> + val = val & ~PCI_L1SS_CTL1_L1SS_MASK; >>>> + pci_write_config_word(dev->bus->self, >>>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, >>>> + val); >>> >>> Updates the parent (dev->bus->self) twice; was the first one supposed >>> to update the device (dev)? >>> >>> This doesn't restore the state as it existed before suspend. Does >>> this rely on other parts of restore to do that? >>> >>>> + } >>>> +} >>>> + >>>> +DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_REALTEK, 0x8168, >>>> + rtl8169_resume_aspm_settings); >>>> #endif >>>> >>>> #ifdef CONFIG_PCIE_DPC >>>> -- >>>> 2.45.2.803.g4e1b14247a-goog >>>> >>