On Thu, Oct 8, 2020 at 6:19 AM Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx> wrote: > > > > > On Oct 7, 2020, at 21:30, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > On Wed, Oct 07, 2020 at 12:26:19PM +0800, Kai-Heng Feng wrote: > >>> On Oct 6, 2020, at 03:19, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > >>> On Tue, Oct 06, 2020 at 02:40:32AM +0800, Kai-Heng Feng wrote: > >>>>> On Oct 3, 2020, at 06:18, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > >>>>> On Wed, Sep 30, 2020 at 04:24:54PM +0800, Kai-Heng Feng wrote: > > > > ... > >>>> I wonder whether other devices that add PCIe domain have the same > >>>> behavior? Maybe it's not a special case at all... > >>> > >>> What other devices are these? > >> > >> Controllers which add PCIe domain. > > > > I was looking for specific examples, not just a restatement of what > > you said before. I'm just curious because there are a lot of > > controllers I'm not familiar with, and I can't think of an example. > > > >>>> I understand the end goal is to keep consistency for the entire ASPM > >>>> logic. However I can't think of any possible solution right now. > >>>> > >>>>> - If we built with CONFIG_PCIEASPM_POWERSAVE=y, would that solve the > >>>>> SoC power state problem? > >>>> > >>>> Yes. > >>>> > >>>>> - What issues would CONFIG_PCIEASPM_POWERSAVE=y introduce? > >>>> > >>>> This will break many systems, at least for the 1st Gen Ryzen > >>>> desktops and laptops. > >>>> > >>>> All PCIe ASPM are not enabled by BIOS, and those systems immediately > >>>> freeze once ASPM is enabled. > >>> > >>> That indicates a defect in the Linux ASPM code. We should fix that. > >>> It should be safe to use CONFIG_PCIEASPM_POWERSAVE=y on every system. > >> > >> On those systems ASPM are also not enabled on Windows. So I think > >> ASPM are disabled for a reason. > > > > If the platform knows ASPM needs to be disabled, it should be using > > ACPI_FADT_NO_ASPM or _OSC to prevent the OS from using it. And if > > CONFIG_PCIEASPM_POWERSAVE=y means Linux enables ASPM when it > > shouldn't, that's a Linux bug that we need to fix. > > Yes that's a bug which fixed by Ian's new patch. > > > > >>> Are there bug reports for these? The info we would need to start with > >>> includes "lspci -vv" and dmesg log (with CONFIG_PCIEASPM_DEFAULT=y). > >>> If a console log with CONFIG_PCIEASPM_POWERSAVE=y is available, that > >>> might be interesting, too. We'll likely need to add some > >>> instrumentation and do some experimentation, but in principle, this > >>> should be fixable. > >> > >> Doing this is asking users to use hardware settings that ODM/OEM > >> never tested, and I think the risk is really high. > > > > What? That's not what I said at all. I'm asking for information > > about these hangs so we can fix them. I'm not suggesting that you > > should switch to CONFIG_PCIEASPM_POWERSAVE=y for the distro. > > Ah, I thought your suggestion is switching to CONFIG_PCIEASPM_POWERSAVE=y, because I sense you want to use that to cover the VMD ASPM this patch tries to solve. > > Do we have a conclusion how to enable VMD ASPM with CONFIG_PCIEASPM_DEFAULT=y? > > > > > Let's back up. You said: > > > > CONFIG_PCIEASPM_POWERSAVE=y ... will break many systems, at least > > for the 1st Gen Ryzen desktops and laptops. > > > > All PCIe ASPM are not enabled by BIOS, and those systems immediately > > freeze once ASPM is enabled. > > > > These system hangs might be caused by (1) some hardware issue that > > causes a hang when ASPM is enabled even if it is configured correctly > > or (2) Linux configuring ASPM incorrectly. > > It's (2) here. > > > > > For case (1), the platform should be using ACPI_FADT_NO_ASPM or _OSC > > to prevent the OS from enabling ASPM. Linux should pay attention to > > that even when CONFIG_PCIEASPM_POWERSAVE=y. > > > > If the platform *should* use these mechanisms but doesn't, the > > solution is a quirk, not the folklore that "we can't use > > CONFIG_PCIEASPM_POWERSAVE=y because it breaks some systems." > > The platform in question doesn't prevent OS from enabling ASPM. > > > > > For case (2), we should fix Linux so it configures ASPM correctly. > > > > We cannot use the build-time CONFIG_PCIEASPM settings to avoid these > > hangs. We need to fix the Linux run-time code so the system operates > > correctly no matter what CONFIG_PCIEASPM setting is used. > > > > We have sysfs knobs to control ASPM (see 72ea91afbfb0 ("PCI/ASPM: Add > > sysfs attributes for controlling ASPM link states")). They can do the > > same thing at run-time as CONFIG_PCIEASPM_POWERSAVE=y does at > > build-time. If those knobs cause hangs on 1st Gen Ryzen systems, we > > need to fix that. > > Ian's patch solves the issue, at least for the systems I have. Could you add: diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c index 15d64832a988..cd9f2101f9a2 100644 --- a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c @@ -482,7 +482,12 @@ static void pcie_aspm_check_latency(struct pci_dev *endpoint) latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1); l1_max_latency = max_t(u32, latency, l1_max_latency); if (l1_max_latency + l1_switch_latency > acceptable->l1) + { + pci_info(endpoint, "L1 latency exceeded - path: %i - max: %i\n", l1_switch_latency, l1_max_latency); + pci_info(link->pdev, "Upstream device - %i\n", link->latency_up.l1); + pci_info(link->downstream, "Downstream device - %i\n", link->latency_dw.l1); link->aspm_capable &= ~ASPM_STATE_L1; + } l1_switch_latency += 1000; } So we can see what device triggers what links to be disabled? I think your use-case is much more important than mine - mine fixes something as a side effect Also, please send me the lspci -vvv output as well as lspci -PP -s <device id> for the device id:s mentioned in dmesg with the patch above applied ;) > Kai-Heng > > > > > Bjorn >