On Mon, Jun 19, 2023 at 11:16:35AM -0500, Limonciello, Mario wrote: > On 6/15/2023 10:01 PM, Kai-Heng Feng wrote: > > On Fri, Jun 16, 2023 at 1:12 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > On Thu, Jun 15, 2023 at 03:04:20PM +0800, Kai-Heng Feng wrote: > > > > When a PCIe device is hotplugged to a Thunderbolt port, ASPM is not > > > > enabled for that device. However, when the device is plugged preboot, > > > > ASPM is enabled by default. > > > > > > > > The disparity happens because BIOS doesn't have the ability to program > > > > ASPM on hotplugged devices. > > > > > > > > So enable ASPM by default for external connected PCIe devices so ASPM > > > > settings are consitent between preboot and hotplugged. > > > > > > > > On HP Thunderbolt Dock G4, enable ASPM can also fix BadDLLP error: > > > > pcieport 0000:00:1d.0: AER: Corrected error received: 0000:07:04.0 > > > > pcieport 0000:07:04.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) > > > > pcieport 0000:07:04.0: device [8086:0b26] error status/mask=00000080/00002000 > > > > pcieport 0000:07:04.0: [ 7] BadDLLP > > > > > > > > The root cause is still unclear, but quite likely because the I225 on > > > > the dock supports PTM, where ASPM timing is precalculated for the PTM. > > > > > > > > Cc: Mario Limonciello <mario.limonciello@xxxxxxx> > > > > Cc: Mika Westerberg <mika.westerberg@xxxxxxxxxxxxxxx> > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=217557 > > > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx> > > > > --- > > > > drivers/pci/pcie/aspm.c | 4 +++- > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c > > > > index 66d7514ca111..613b0754c9bb 100644 > > > > --- a/drivers/pci/pcie/aspm.c > > > > +++ b/drivers/pci/pcie/aspm.c > > > > @@ -119,7 +119,9 @@ static int policy_to_aspm_state(struct pcie_link_state *link) > > > > /* Enable Everything */ > > > > return ASPM_STATE_ALL; > > > > case POLICY_DEFAULT: > > > > - return link->aspm_default; > > > > + return dev_is_removable(&link->downstream->dev) ? > > > > + link->aspm_capable : > > > > + link->aspm_default; > > > > > > I'm a little hesitant because dev_is_removable() is a convenient > > > test that covers the current issue, but it doesn't seem tightly > > > connected from a PCIe architecture perspective. > > > > > > I think the current model of compile-time ASPM policy selection: > > > > > > CONFIG_PCIEASPM_DEFAULT /* BIOS default setting */ > > > CONFIG_PCIEASPM_PERFORMANCE /* disable L0s and L1 */ > > > CONFIG_PCIEASPM_POWERSAVE /* enable L0s and L1 */ > > > CONFIG_PCIEASPM_POWER_SUPERSAVE /* enable L1 substates */ > > > > > > is flawed. As far as I know, there's no technical reason we > > > have to select this at kernel build-time. I suspect the > > > original reason was risk avoidance, i.e., we were worried that > > > we might expose hardware defects if we enabled ASPM states that > > > BIOS hadn't already enabled. > > > > > > How do we get out of that model? We do have sysfs knobs that > > > should cover all the functionality (set overall policy as above > > > via /sys/module/pcie_aspm/parameters/policy; set device-level > > > exceptions via /sys/bus/pci/devices/.../link/*_aspm). > > > > Agree. Build-time policy can be obsoleted by boot-time argument. > > I agree as well. This isn't strictly relevant to the current problem, so let's put this on the back burner for now. > > > In my opinion, the cleanest solution would be to enable all ASPM > > > functionality whenever possible and let users disable it if they > > > need to for performance. If there are device defects when > > > something is enabled, deal with it via quirks, as we do for > > > other PCI features. > > > > > > That feels a little risky, but let's have a conversation about > > > where we want to go in the long term. It's good to avoid risk, > > > but too much avoidance leads to its own complexity and an > > > inability to change things. > > > > I think we should separate the situation into two cases: > > - When BIOS/system firmware has the ability to program ASPM, honor > > it. This applies to most "internal" PCI devices. > > - When BIOS/system can't program ASPM, enable ASPM for whatever > > it's capable of. Most notable case is Intel VMD controller, and > > this patch for devices connected through TBT. > > > > Enabling all ASPM functionality regardless of what's being > > pre-programmed by BIOS is way too risky. Disabling ASPM to > > workaround issues and defects are still quite common among > > hardware manufacturers. It sounds like you have actual experience with this :) Do you have any concrete examples that we can use as "known breakage"? This feels like a real problem to me. There are existing mechanisms (ACPI_FADT_NO_ASPM and _OSC PCIe cap ownership) the platform can use to prevent the OS from using ASPM. If vendors assume that *in addition*, the OS will pay attention to whatever ASPM configuration BIOS did, that's a major disconnect. We don't do anything like that for other PCI features, and I'm not aware of any restriction like that being documented. > I think the pragmatic way to approach it is to (essentially) apply > the policy as BIOS defaults and allow overrides from that. Do you mean that when enumerating a device (at boot-time or hot-add time), we would read the current ASPM config but not change it? And users could use the sysfs knobs to enable/disable ASPM as desired? That wouldn't solve the problem Kai-Heng is trying to solve. Or that we leave ASPM alone during boot-time enumeration, but enable ASPM when we enumerate hot-added devices? It doesn't sound right that a device would be configured differently if present at boot vs hot-added. Bjorn