On Thu, Jul 11, 2024 at 7:45 AM Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote: > > On 10.07.2024 17:09, George-Daniel Matei wrote: > > Hi, > > > >>> Added aspm suspend/resume hooks that run > >>> before and after suspend and resume to change > >>> the ASPM states of the PCI bus in order to allow > >>> the system suspend while trying to prevent card hangs > >> > >> Why is this needed? Is there a r8169 defect we're working around? > >> A BIOS defect? Is there a problem report you can reference here? > >> > > > > We encountered this issue while upgrading from kernel v6.1 to v6.6. > > The system would not suspend with 6.6. We tracked down the problem to > > the NIC of the device, mainly that the following code was removed in > > 6.6: > >> else if (tp->mac_version >= RTL_GIGA_MAC_VER_46) > >> rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);1 > > With this (older) 6.1 version everything is ok? > Would mean that L1.1 is active and the system suspends (STR?) properly > also with L1.1 being active. > Yes, with 6.1 everything was ok. L1 was active and just the L1.1 substate was enabled, L1.2 was disabled. > Under 6.6 per default L1 (incl. sub-states) is disabled. > Then you manually enable L1 (incl. L1.1, but not L1.2?) via sysfs, > and now the system hangs on suspend? > Yes, in 6.6 L1 (+substates) is disabled. Like Bjorn mentioned, I think that is because of 90ca51e8c654 ("r8169: fix ASPM-related issues on a number of systems with NIC version from RTL8168h". With L1 disabled the system would not suspend so I enabled back L1 along with just L1.1 substate through sysfs, just to test, and saw that the system could suspend again. L1 is disabled by default for a reason, that's because it could cause tx timeouts. So to try to work around the possible timeouts I thought of changing the ASPM states before suspending and then restoring on resume. > Is this what you're saying? Would be strange because in both cases > L1.1 is active when suspending. > > > > For the listed devices, ASPM L1 is disabled entirely in 6.6. As for > > the reason, L1 was observed to cause some problems > > (https://bugzilla.kernel.org/show_bug.cgi?id=217814). We use a Raptor > > Lake soc and it won't change residency if the NIC doesn't have L1 > > enabled. I saw in 6.1 the following comment: > >> Chips from RTL8168h partially have issues with L1.2, but seem > >> to work fine with L1 and L1.1. > > I was thinking that disabling/enabling L1.1 on the fly before/after > > suspend could help mitigate the risk associated with L1/L1.1 . I know > > that ASPM settings are exposed in sysfs and that this could be done > > from outside the kernel, that was my first approach, but it was > > suggested to me that this kind of workaround would be better suited > > for quirks. I did around 1000 suspend/resume cycles of 16-30 seconds > > each (correcting the resume dev->bus->self being configured twice > > mistake) and did not notice any problems. What do you think, is this a > > good approach ... ? > > > >>> + //configure device > >>> + pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL, > >>> + PCI_EXP_LNKCTL_ASPMC, 0); > >>> + > >>> + pci_read_config_word(dev->bus->self, > >>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, > >>> + &val); > >>> + val = val & ~PCI_L1SS_CTL1_L1SS_MASK; > >>> + pci_write_config_word(dev->bus->self, > >>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, > >>> + val); > >> Updates the parent (dev->bus->self) twice; was the first one supposed > >> to update the device (dev)? > > Yes, it was supposed to update the device (dev). It's my first time > > sending a patch and I messed something up while doing some style > > changes, I will correct it. I'm sorry for that. > > > >> This doesn't restore the state as it existed before suspend. Does > >> this rely on other parts of restore to do that? > > It operates on the assumption that after driver initialization > > PCI_EXP_LNKCTL_ASPMC is 0 and that there are no states enabled in > > CTL1. I did a lspci -vvv dump on the affected devices before and after > > the quirks ran and saw no difference. This could be improved. > > > >> What is the RTL8168 chip version used on these systems? > > It should be RTL8111H. > > > >> What's the root cause of the issue? > >> A silicon bug on the host side? > > I think it's the ASPM implementation of the soc. > > > >> ASPM L1 is disabled per default in r8169. So why is the patch needed > >> at all? > > Leaving it disabled all the time prevents the system from suspending. > > > > Thank you, > > George-Daniel Matei > > > > > > > > > > > > On Tue, Jul 9, 2024 at 12:15 AM Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote: > >> > >> On 08.07.2024 19:23, Bjorn Helgaas wrote: > >>> [+cc r8169 folks] > >>> > >>> On Mon, Jul 08, 2024 at 03:38:15PM +0000, George-Daniel Matei wrote: > >>>> Added aspm suspend/resume hooks that run > >>>> before and after suspend and resume to change > >>>> the ASPM states of the PCI bus in order to allow > >>>> the system suspend while trying to prevent card hangs > >>> > >>> Why is this needed? Is there a r8169 defect we're working around? > >>> A BIOS defect? Is there a problem report you can reference here? > >>> > >> > >> Basically the same question from my side. Apparently such a workaround > >> isn't needed on any other system. And Realtek NICs can be found on more > >> or less every consumer system. What's the root cause of the issue? > >> A silicon bug on the host side? > >> > >> What is the RTL8168 chip version used on these systems? > >> > >> ASPM L1 is disabled per default in r8169. So why is the patch needed > >> at all? > >> > >>> s/Added/Add/ > >>> > >>> s/aspm/ASPM/ above > >>> > >>> s/PCI bus/device and parent/ > >>> > >>> Add period at end of sentence. > >>> > >>> Rewrap to fill 75 columns. > >>> > >>>> Signed-off-by: George-Daniel Matei <danielgeorgem@xxxxxxxxxxxx> > >>>> --- > >>>> drivers/pci/quirks.c | 142 +++++++++++++++++++++++++++++++++++++++++++ > >>>> 1 file changed, 142 insertions(+) > >>>> > >>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > >>>> index dc12d4a06e21..aa3dba2211d3 100644 > >>>> --- a/drivers/pci/quirks.c > >>>> +++ b/drivers/pci/quirks.c > >>>> @@ -6189,6 +6189,148 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56b0, aspm_l1_acceptable_latency > >>>> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56b1, aspm_l1_acceptable_latency); > >>>> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56c0, aspm_l1_acceptable_latency); > >>>> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56c1, aspm_l1_acceptable_latency); > >>>> + > >>>> +static const struct dmi_system_id chromebox_match_table[] = { > >>>> + { > >>>> + .matches = { > >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Brask"), > >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), > >>>> + } > >>>> + }, > >>>> + { > >>>> + .matches = { > >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Aurash"), > >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), > >>>> + } > >>>> + }, > >>>> + { > >>>> + .matches = { > >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Bujia"), > >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), > >>>> + } > >>>> + }, > >>>> + { > >>>> + .matches = { > >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Gaelin"), > >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), > >>>> + } > >>>> + }, > >>>> + { > >>>> + .matches = { > >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Gladios"), > >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), > >>>> + } > >>>> + }, > >>>> + { > >>>> + .matches = { > >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Hahn"), > >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), > >>>> + } > >>>> + }, > >>>> + { > >>>> + .matches = { > >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Jeev"), > >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), > >>>> + } > >>>> + }, > >>>> + { > >>>> + .matches = { > >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Kinox"), > >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), > >>>> + } > >>>> + }, > >>>> + { > >>>> + .matches = { > >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Kuldax"), > >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), > >>>> + } > >>>> + }, > >>>> + { > >>>> + .matches = { > >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Lisbon"), > >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), > >>>> + } > >>>> + }, > >>>> + { > >>>> + .matches = { > >>>> + DMI_MATCH(DMI_PRODUCT_NAME, "Moli"), > >>>> + DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"), > >>>> + } > >>>> + }, > >>>> + { } > >>>> +}; > >>>> + > >>>> +static void rtl8169_suspend_aspm_settings(struct pci_dev *dev) > >>>> +{ > >>>> + u16 val = 0; > >>>> + > >>>> + if (dmi_check_system(chromebox_match_table)) { > >>>> + //configure parent > >>>> + pcie_capability_clear_and_set_word(dev->bus->self, > >>>> + PCI_EXP_LNKCTL, > >>>> + PCI_EXP_LNKCTL_ASPMC, > >>>> + PCI_EXP_LNKCTL_ASPM_L1); > >>>> + > >>>> + pci_read_config_word(dev->bus->self, > >>>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, > >>>> + &val); > >>>> + val = (val & ~PCI_L1SS_CTL1_L1SS_MASK) | > >>>> + PCI_L1SS_CTL1_PCIPM_L1_2 | PCI_L1SS_CTL1_PCIPM_L1_2 | > >>>> + PCI_L1SS_CTL1_ASPM_L1_1; > >>>> + pci_write_config_word(dev->bus->self, > >>>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, > >>>> + val); > >>>> + > >>>> + //configure device > >>>> + pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL, > >>>> + PCI_EXP_LNKCTL_ASPMC, > >>>> + PCI_EXP_LNKCTL_ASPM_L1); > >>>> + > >>>> + pci_read_config_word(dev, dev->l1ss + PCI_L1SS_CTL1, &val); > >>>> + val = (val & ~PCI_L1SS_CTL1_L1SS_MASK) | > >>>> + PCI_L1SS_CTL1_PCIPM_L1_2 | PCI_L1SS_CTL1_PCIPM_L1_2 | > >>>> + PCI_L1SS_CTL1_ASPM_L1_1; > >>>> + pci_write_config_word(dev, dev->l1ss + PCI_L1SS_CTL1, val); > >>>> + } > >>>> +} > >>>> + > >>>> +DECLARE_PCI_FIXUP_SUSPEND(PCI_VENDOR_ID_REALTEK, 0x8168, > >>>> + rtl8169_suspend_aspm_settings); > >>>> + > >>>> +static void rtl8169_resume_aspm_settings(struct pci_dev *dev) > >>>> +{ > >>>> + u16 val = 0; > >>>> + > >>>> + if (dmi_check_system(chromebox_match_table)) { > >>>> + //configure device > >>>> + pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL, > >>>> + PCI_EXP_LNKCTL_ASPMC, 0); > >>>> + > >>>> + pci_read_config_word(dev->bus->self, > >>>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, > >>>> + &val); > >>>> + val = val & ~PCI_L1SS_CTL1_L1SS_MASK; > >>>> + pci_write_config_word(dev->bus->self, > >>>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, > >>>> + val); > >>>> + > >>>> + //configure parent > >>>> + pcie_capability_clear_and_set_word(dev->bus->self, > >>>> + PCI_EXP_LNKCTL, > >>>> + PCI_EXP_LNKCTL_ASPMC, 0); > >>>> + > >>>> + pci_read_config_word(dev->bus->self, > >>>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, > >>>> + &val); > >>>> + val = val & ~PCI_L1SS_CTL1_L1SS_MASK; > >>>> + pci_write_config_word(dev->bus->self, > >>>> + dev->bus->self->l1ss + PCI_L1SS_CTL1, > >>>> + val); > >>> > >>> Updates the parent (dev->bus->self) twice; was the first one supposed > >>> to update the device (dev)? > >>> > >>> This doesn't restore the state as it existed before suspend. Does > >>> this rely on other parts of restore to do that? > >>> > >>>> + } > >>>> +} > >>>> + > >>>> +DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_REALTEK, 0x8168, > >>>> + rtl8169_resume_aspm_settings); > >>>> #endif > >>>> > >>>> #ifdef CONFIG_PCIE_DPC > >>>> -- > >>>> 2.45.2.803.g4e1b14247a-goog > >>>> > >> >