On 6/19/2023 4:37 PM, Bjorn Helgaas wrote:
On Mon, Jun 19, 2023 at 11:16:35AM -0500, Limonciello, Mario wrote:
On 6/15/2023 10:01 PM, Kai-Heng Feng wrote:
On Fri, Jun 16, 2023 at 1:12 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
On Thu, Jun 15, 2023 at 03:04:20PM +0800, Kai-Heng Feng wrote:
When a PCIe device is hotplugged to a Thunderbolt port, ASPM is not
enabled for that device. However, when the device is plugged preboot,
ASPM is enabled by default.
The disparity happens because BIOS doesn't have the ability to program
ASPM on hotplugged devices.
So enable ASPM by default for external connected PCIe devices so ASPM
settings are consitent between preboot and hotplugged.
On HP Thunderbolt Dock G4, enable ASPM can also fix BadDLLP error:
pcieport 0000:00:1d.0: AER: Corrected error received: 0000:07:04.0
pcieport 0000:07:04.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
pcieport 0000:07:04.0: device [8086:0b26] error status/mask=00000080/00002000
pcieport 0000:07:04.0: [ 7] BadDLLP
The root cause is still unclear, but quite likely because the I225 on
the dock supports PTM, where ASPM timing is precalculated for the PTM.
Cc: Mario Limonciello <mario.limonciello@xxxxxxx>
Cc: Mika Westerberg <mika.westerberg@xxxxxxxxxxxxxxx>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217557
Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
---
drivers/pci/pcie/aspm.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index 66d7514ca111..613b0754c9bb 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -119,7 +119,9 @@ static int policy_to_aspm_state(struct pcie_link_state *link)
/* Enable Everything */
return ASPM_STATE_ALL;
case POLICY_DEFAULT:
- return link->aspm_default;
+ return dev_is_removable(&link->downstream->dev) ?
+ link->aspm_capable :
+ link->aspm_default;
I'm a little hesitant because dev_is_removable() is a convenient
test that covers the current issue, but it doesn't seem tightly
connected from a PCIe architecture perspective.
I think the current model of compile-time ASPM policy selection:
CONFIG_PCIEASPM_DEFAULT /* BIOS default setting */
CONFIG_PCIEASPM_PERFORMANCE /* disable L0s and L1 */
CONFIG_PCIEASPM_POWERSAVE /* enable L0s and L1 */
CONFIG_PCIEASPM_POWER_SUPERSAVE /* enable L1 substates */
is flawed. As far as I know, there's no technical reason we
have to select this at kernel build-time. I suspect the
original reason was risk avoidance, i.e., we were worried that
we might expose hardware defects if we enabled ASPM states that
BIOS hadn't already enabled.
How do we get out of that model? We do have sysfs knobs that
should cover all the functionality (set overall policy as above
via /sys/module/pcie_aspm/parameters/policy; set device-level
exceptions via /sys/bus/pci/devices/.../link/*_aspm).
Agree. Build-time policy can be obsoleted by boot-time argument.
I agree as well.
This isn't strictly relevant to the current problem, so let's put this
on the back burner for now.
I think it could fold into it if we end up treating the i225
PCIe device as a quirk as mentioned below.
In my opinion, the cleanest solution would be to enable all ASPM
functionality whenever possible and let users disable it if they
need to for performance. If there are device defects when
something is enabled, deal with it via quirks, as we do for
other PCI features.
That feels a little risky, but let's have a conversation about
where we want to go in the long term. It's good to avoid risk,
but too much avoidance leads to its own complexity and an
inability to change things.
I think we should separate the situation into two cases:
- When BIOS/system firmware has the ability to program ASPM, honor
it. This applies to most "internal" PCI devices.
- When BIOS/system can't program ASPM, enable ASPM for whatever
it's capable of. Most notable case is Intel VMD controller, and
this patch for devices connected through TBT.
Enabling all ASPM functionality regardless of what's being
pre-programmed by BIOS is way too risky. Disabling ASPM to
workaround issues and defects are still quite common among
hardware manufacturers.
It sounds like you have actual experience with this :) Do you have
any concrete examples that we can use as "known breakage"?
A variety of Intel chipsets don't support lane width switching
or speed switching. When ASPM has been enabled on a dGPU,
these features are utilized and breakage ensues.
There are various methods to try to mitigate the impact both in
firmware and driver code.
This feels like a real problem to me. There are existing mechanisms
(ACPI_FADT_NO_ASPM and _OSC PCIe cap ownership) the platform can use
to prevent the OS from using ASPM.
If vendors assume that *in addition*, the OS will pay attention to
whatever ASPM configuration BIOS did, that's a major disconnect. We
don't do anything like that for other PCI features, and I'm not aware
of any restriction like that being documented.
With both of those policies in place, how did we get into
the situation of having configuration options and knobs?
I think the pragmatic way to approach it is to (essentially) apply
the policy as BIOS defaults and allow overrides from that.
Do you mean that when enumerating a device (at boot-time or hot-add
time), we would read the current ASPM config but not change it? And
users could use the sysfs knobs to enable/disable ASPM as desired?
Yes.
That wouldn't solve the problem Kai-Heng is trying to solve.
Alone it wouldn't; but if you treated the i225 PCIe device
connected to the system as a "quirk" to apply ASPM policy
from the parent device to this child device it could.
Or that we leave ASPM alone during boot-time enumeration, but enable
ASPM when we enumerate hot-added devices? It doesn't sound right that
a device would be configured differently if present at boot vs
hot-added.
Same policy for both boot and hot add but specifically if the
device is in a quirk list to enable it or disable it.