Re: [PATCH v2] PCI: Avoid putting some root ports into D3 on some Ryzen chips

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Am 11.12.24 um 22:24 schrieb Mario Limonciello:
On 12/11/2024 06:47, Werner Sembach wrote:
Hi,

Am 10.12.24 um 17:00 schrieb Mario Limonciello:
On 12/10/2024 09:24, Werner Sembach wrote:
Hi,

Am 09.12.24 um 20:45 schrieb Mario Limonciello:
On 12/9/2024 13:36, Werner Sembach wrote:
From: Mario Limonciello <mario.limonciello@xxxxxxx>

commit 9d26d3a8f1b0 ("PCI: Put PCIe ports into D3 during suspend")
sets the policy that all PCIe ports are allowed to use D3. When
the system is suspended if the port is not power manageable by the
platform and won't be used for wakeup via a PME this sets up the
policy for these ports to go into D3hot.

This policy generally makes sense from an OSPM perspective but it leads
to problems with wakeup from suspend on the TUXEDO Sirius 16 Gen 1 with
an unupdated BIOS.

- On family 19h model 44h (PCI 0x14b9) this manifests as a missing wakeup
   interrupt.
- On family 19h model 74h (PCI 0x14eb) this manifests as a system hang.

On the affected Device + BIOS combination, add a quirk for the PCI device
ID used by the problematic root port on both chips to ensure that these
root ports are not put into D3hot at suspend.

This patch is based on
https://lore.kernel.org/linux-pci/20230708214457.1229-2- mario.limonciello@xxxxxxx/
but with the added condition both in the documentation and in the code to
apply only to the TUXEDO Sirius 16 Gen 1 with the original unpatched BIOS.

Co-developed-by: Georg Gottleuber <ggo@xxxxxxxxxxxxxxxxxxx>
Signed-off-by: Georg Gottleuber <ggo@xxxxxxxxxxxxxxxxxxx>
Co-developed-by: Werner Sembach <wse@xxxxxxxxxxxxxxxxxxx>
Signed-off-by: Werner Sembach <wse@xxxxxxxxxxxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx # 6.1+
Reported-by: Iain Lane <iain@xxxxxxxxxxxxxxxxxxx>
Closes: https://forums.lenovo.com/t5/Ubuntu/Z13-can-t-resume-from- suspend-with-external-USB-keyboard/m-p/5217121
Fixes: 9d26d3a8f1b0 ("PCI: Put PCIe ports into D3 during suspend")
Signed-off-by: Mario Limonciello <mario.limonciello@xxxxxxx>
---
  drivers/pci/quirks.c | 31 +++++++++++++++++++++++++++++++
  1 file changed, 31 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 76f4df75b08a1..2226dca56197d 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3908,6 +3908,37 @@ static void quirk_apple_poweroff_thunderbolt(struct pci_dev *dev)
  DECLARE_PCI_FIXUP_SUSPEND_LATE(PCI_VENDOR_ID_INTEL,
PCI_DEVICE_ID_INTEL_CACTUS_RIDGE_4C,
                     quirk_apple_poweroff_thunderbolt);
+
+/*
+ * Putting PCIe root ports on Ryzen SoCs with USB4 controllers into D3hot
+ * may cause problems when the system attempts wake up from s2idle.
+ *
+ * On family 19h model 44h (PCI 0x14b9) this manifests as a missing wakeup
+ * interrupt.
+ * On family 19h model 74h (PCI 0x14eb) this manifests as a system hang.
+ *
+ * This fix is still required on the TUXEDO Sirius 16 Gen1 with the original
+ * unupdated BIOS.
+ */
+static const struct dmi_system_id quirk_ryzen_rp_d3_dmi_table[] = {
+    {
+        .matches = {
+            DMI_MATCH(DMI_SYS_VENDOR, "TUXEDO"),
+            DMI_MATCH(DMI_BOARD_NAME, "APX958"),
+            DMI_MATCH(DMI_BIOS_VERSION, "V1.00A00"),
+        },
+    },
+    {}
+};
+
+static void quirk_ryzen_rp_d3(struct pci_dev *pdev)
+{
+    if (dmi_check_system(quirk_ryzen_rp_d3_dmi_table) &&
+        !acpi_pci_power_manageable(pdev))
+        pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x14b9, quirk_ryzen_rp_d3);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x14eb, quirk_ryzen_rp_d3);
  #endif
    /*

Wait, what is wrong with:

commit 7d08f21f8c630 ("x86/PCI: Avoid PME from D3hot/D3cold for AMD Rembrandt and Phoenix USB4")

Is that not activating on your system for some reason?

Doesn't seem so, tested with the old BIOS and 6.13-rc2 and had blackscreen on wakeup.

OK, I think we need to dig a layer deeper to see which root port is causing problems to understand this.


With a newer BIOS for that device suspend and resume however works.


Is there any reason that people would realistically be staying on the old BIOS and instead we need to carry this quirk in the kernel for eternity?
Fear of device bricking or not knowing an update is available.

The not knowing is solved by publishing firmware to LVFS.

Most "popular" distributions include fwupd, regularly check for updates and will notify users through the MOTD or a graphical application that there is something available.


With the Linux ecosystem for BIOS updates through fwupd + LVFS it's not a very big barrier to entry to do an update like it was 20 years ago.
Sadly fwupd/LVFS does not support executing arbitrary efi binaries/nsh scripts which still is the main form ODMs provide bios updates.

It's tangential to this thread; but generally ODMs will provide you a capsule if you ask for one.

I already evaluated this a while back with the results:

If the BIOS is an AMI one you can get a capsule, but this capsule might overwrite DMI strings or the vendor boot logo if not flashed with the AMI flasher and extra flags (sadly I was not able to find out what these flags do under the hood to give fwupd the same capabilities). Also these capsules often not include Intel ME and EC firmware updates and certain bios version might only work with certain EC firmwares.

Last I checked in with the ODM that uses Insyde BIOSes we where not able to get a capsule update. I don't know if that is an ODM or Insyde problem.

For the rest I will come back to this on Monday as I'm currently away from the testing device, thanks for all the feedback so far.


Anyhow if you are providing scripts and random EFI binaries in order to get things updated, that explains why this is a large enough chunk of people to justify a patch like this.


Looking in the patch the device id's are different (0x162e, 0x162f, 0x1668, and 0x1669).


TUXEDO Sirius 16 Gen1 is Phoenix based, right?  So at a minimum you shouldn't be including PCI IDs from Rembrandt (0x14b9)
Thanks for the hint, I can delete that then.

Here is the topology from a Phoenix system on my side:

https://gist.github.com/superm1/85bf0c053008435458bdb39418e109d8

Sorry for the noob question: How do I get that format? it's not lspci - tvnn on my system

No worry.

Having looked at a lot of s2idle bugs I've found it's generally helpful to know what ACPI companion is assigned to devices and what are parents.

So it's actually created by this function in the s2idle issue triage script, amd_s2idle.py.

https://gitlab.freedesktop.org/drm/amd/-/blob/master/scripts/amd_s2idle.py#L1945

And while on the topic, does your "broken" BIOS report this from that script?

'''
Platform may hang resuming.  Upgrade your firmware or add pcie_port_pm=off to kernel command line if you have problems
'''


But i think this contains the same information:

$ lspci -Pnn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Root Complex [1022:14e8] 00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Phoenix IOMMU [1022:14e9] 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge [1022:14ea] 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Phoenix GPP Bridge [1022:14ed] 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge [1022:14ea] 00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Phoenix GPP Bridge [1022:14ee] 00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Phoenix GPP Bridge [1022:14ee] 00:02.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Phoenix GPP Bridge [1022:14ee] 00:02.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Phoenix GPP Bridge [1022:14ee] 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge [1022:14ea] 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 19h USB4/Thunderbolt PCIe tunnel [1022:14ef] 00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge [1022:14ea] 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge [1022:14ea] 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Phoenix Internal GPP Bridge to Bus [C:A] [1022:14eb] 00:08.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Phoenix Internal GPP Bridge to Bus [C:A] [1022:14eb] 00:08.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Phoenix Internal GPP Bridge to Bus [C:A] [1022:14eb] 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 71) 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51) 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 0 [1022:14f0] 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 1 [1022:14f1] 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 2 [1022:14f2] 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 3 [1022:14f3] 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 4 [1022:14f4] 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 5 [1022:14f5] 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 6 [1022:14f6] 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 7 [1022:14f7] 00:01.1/00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev 12) 00:01.1/00.0/00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479] (rev 12) 00:01.1/00.0/00.0/00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 33 [Radeon RX 7600/7600 XT/7600M XT/7600S/7700S / PRO W7600] [1002:7480] (rev c7) 00:01.1/00.0/00.0/00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 HDMI/DP Audio [1002:ab30] 00:02.1/00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15) 00:02.2/00.0 Network controller [0280]: Intel Corporation Wi-Fi 6E(802.11ax) AX210/AX1675* 2x2 [Typhoon Peak] [8086:2725] (rev 1a) 00:02.4/00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808] 00:08.1/00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Phoenix1 [1002:15bf] (rev c1) 00:08.1/00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt Radeon High Definition Audio Controller [1002:1640] 00:08.1/00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Phoenix CCP/PSP 3.0 Device [1022:15c7] 00:08.1/00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:15b9] 00:08.1/00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:15ba] 00:08.1/00.5 Multimedia controller [0480]: Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor [1022:15e2] (rev 63) 00:08.1/00.6 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller [1022:15e3] 00:08.2/00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Function [1022:14ec] 00:08.2/00.1 Signal processing controller [1180]: Advanced Micro Devices, Inc. [AMD] AMD IPU Device [1022:1502] 00:08.3/00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Function [1022:14ec] 00:08.3/00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:15c0] 00:08.3/00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:15c1] 00:08.3/00.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Pink Sardine USB4/Thunderbolt NHI controller #1 [1022:1668]


That's why 7d08f21f8c630 intentionally matches the NHI and then changes the root port right above that instead of all the root ports - because that is where the problem was.
For some reason it doesn't seem to trigger on my system (added debug output) I will look further into it why that happens.

You never see this message in your logs at suspend time (even on a "fixed" BIOS)?

"quirk: disabling D3cold for suspend"

I'm /suspecting/ you do see it, but you're having problems with another root port.

I mentioned this in my previous iterations of patches that eventually landed on that quirk, but Windows and Linux handle root ports differently at suspend time and that could be why it's exposing your BIOS bug.

If you can please narrow down which root ports actually need the quirk for your side (feel free to do a similar style to 7d08f21f8c630) I think we could land on something more narrow and upstreamable.

At a minimum what you're doing today is covering both Rembrandt and Phoenix and it should only apply to Phoenix.





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux