On Wed, Aug 23, 2023 at 06:46:19AM -0500, Bjorn Helgaas wrote: > On Wed, Aug 23, 2023 at 07:04:53AM +0200, Lukas Wunner wrote: > > On Tue, Aug 22, 2023 at 07:02:43PM -0500, Bjorn Helgaas wrote: > > > On Tue, Aug 22, 2023 at 12:11:10PM +0200, Rafael J. Wysocki wrote: > > > > What we need to deal with here is basically non-compliant systems and > > > > so we have to catch the various forms of non-compliance. > > > > > > Thanks for this, that helps. If pci_bridge_d3_possible() is a list of > > > quirks for systems that are known to be broken (or at least not known > > > to work correctly and avoiding D3 is acceptable), then we should > > > document and use it that way. > > > > > > The current documentation ("checks if it is possible to move to D3") > > > frames it as "does the bridge have the required features?" instead of > > > "do we know about something broken in this bridge or this platform?" > > > > > > If something is broken, I would expect tests based on the device or > > > DMI check. But several some are not obvious defects. E.g., > > > "bridge->is_hotplug_bridge && !pciehp_is_native(bridge)" -- what > > > defect are we finding there? What does the spec require that isn't > > > happening? > > > > This particular check doesn't pertain to a defect, but indeed > > follows from the spec: > > > > If hotplug control wasn't granted to the OS, the OS shall not put > > the hotplug port in D3 behind firmware's back because the power state > > affects accessibility of devices downstream of the hotplug port. > > > > Put another way, the firmware expects to have control of hotplug > > and hotplug may break if the OS fiddles with the power state of the > > hotplug port. > > > > Here's a bugzilla where this caused issues: > > https://bugzilla.kernel.org/show_bug.cgi?id=53811 > > > > On the other hand Thunderbolt hotplug ports are required to runtime > > suspend to D3 in order to save power. > > Sounds like there may be a requirement in a Thunderbolt spec about > this, so maybe we could add that citation? I guess this goes with the > "bridge->is_thunderbolt" check? Right, that's the check I was referring to. But I'm afraid there is no explicit rule in Thunderbolt / USB4 specs that hotplug ports must be runtime suspended in order to save power, at least to the best of my knowledge. In practice, Thunderbolt controllers come in one of two forms, discrete or integrated into the CPU. Originally only discrete controllers existed, but over time it became more and more common to integrate them. Apple was pretty much the only vendor which sold larger quantities of Thunderbolt 1 and 2 chips in the 2010 to 2016 era. Back in the day, only discrete controllers existed and they consumed around 1.5 to 2 W. On laptops, that's a significant amount of energy, so from day 1, Apple put load switches on their motherboards which allowed the Thunderbolt controllers to be powered down if nothing is plugged in. With the *integrated* Thunderbolt controllers, powering down on idle is usually likewise required to allow the entire CPU package to enter a low power state. To the operating system, a Thunderbolt controller is visible as a PCIe switch with an NHI device below one of the Downstream Ports and hotplugged devices appearing below the other Downstream Ports. The NHI is a vendor-agnostic Native Host Interface for tunnel setup, similar to the OHCI, EHCI, XHCI interface definitions that are in use with USB and FireWire controllers. Linux uses a hierarchical power management model, i.e. parent devices cannot runtime suspend unless their children runtime suspend. Thus, the hotplug ports need to runtime suspend and then the Switch Upstream Port can runtime suspend and that triggers powerdown of the controller. To cut a long story short, in *practice* Thunderbolt hotplug ports need to be put into D3hot in order to save power, so that's what we do. And we know from experience that they're all *safe* to be put into D3hot. Hence we're whitelisting them in pci_bridge_d3_possible(). By contrast, I recall that we got MCEs on Xeon-SP processors back in the day when their Root Ports were put into D3hot. Hence the rather conservative approach taken in pci_bridge_d3_possible() to whitelist only known-good, newer hardware. > > On Macs they're always handled > > natively by the OS. Hence the code comment. > > And I guess this goes with the "System Management Mode" and > "Thunderbolt on non-Macs" comments? A citation to the source behind > "OS shall not put the hotplug port in D3 behind firmware's back" would > be super helpful here. So I've just looked through the PCI Firmware Spec and can't find that mentioned anywhere explicitly, but it's pretty obvious if you think about it: If the OS puts the hotplug port in D3hot, its downstream bus transitions to either B2 or B3 (PCI Power Management Spec r1.2 sec 4.7.1). Which means devices downstream of the hotplug port become inaccessible. If hotplug control wasn't granted to the OS, firmware expects it may handle the hotplug port without interference by the OS. If the OS fiddles with the hotplug port's power state, that expectation is no longer met. Let's say the firmware reads the downstream device's Vendor ID register to probe whether the device is there. That'll no longer work as expected once the hotplug port is transitioned to D3hot by the operating system. In retrospect, this code comment is probably confusing: /* * Hotplug ports handled by firmware in System Management Mode * may not be put into D3 by the OS (Thunderbolt on non-Macs). */ if (bridge->is_hotplug_bridge && !pciehp_is_native(bridge)) return false; This is not in any way specific to Thunderbolt. It's just that back in the day, Thunderbolt tunnel management was done natively on Macs, whereas non-Macs did it in firmware. That has since changed and most vendors have adopted native tunnel management. So the code comment is outdated. The following would probably be more accurate today: /* * Hotplug ports handled by firmware in System Management Mode * may not be put into D3 by the OS (behind firmware's back). */ > > A somewhat longer explanation I gave in 2016: > > https://lore.kernel.org/all/20160617213209.GA1927@xxxxxxxxx/ > > > > Perhaps the code comment preceding that check can be rephrased to > > convey its meaning more clearly... > > Thanks! I think it would be worth trying to separate out the "normal" > things that correspond to the spec from the "quirk" things that work > around defects. That's not material for *this* patch, though. > > It's also a little weird that pci_bridge_d3_possible() itself looks > like it's invariant for the life of the system, but we call it several > times (pci_pm_init(), pci_bridge_d3_update(), pcie_portdrv_probe(), > etc). I guess this is because we save the result in dev->bridge_d3, > but then pci_bridge_d3_update() updates dev->bridge_d3 based on other > things, so the original value is lost. Maybe another bit or two could > avoid those extra calls. Right on all accounts. Those invocations of pci_bridge_d3_possible() are all in code paths which run only once, e.g. on enumeration and removal of the PCIe port and on shutdown. We figured that it's not worth it to cache the return value of pci_bridge_d3_possible() for these few invocations, none of which are in hot paths. ("We" is mostly Mika and yours truly, who introduced this for Thunderbolt power management.) Thanks, Lukas