Hello, On Mon, Dec 20, 2021 at 3:06 PM David E. Box <david.e.box@xxxxxxxxxxxxxxx> wrote: > > Hi Born, > > So you know this patch comes from our platform power management team to address several VMD bugs > affecting power. The VMD driver developers are already cc'd but also adding Francisco from the VMD > team who can maybe help clarify. > > On Mon, 2021-12-20 at 11:28 -0600, Bjorn Helgaas wrote: > > On Thu, Dec 16, 2021 at 01:24:00PM -0800, David E. Box wrote: > > > On Thu, 2021-12-16 at 11:26 -0600, Bjorn Helgaas wrote: > > > > On Wed, Dec 15, 2021 at 09:56:00PM -0800, David E. Box wrote: > > > > > From: Michael Bottini <michael.a.bottini@xxxxxxxxxxxxxxx> > > > > > > > > > > On Tiger Lake and Alder Lake platforms, VMD controllers do not have ASPM > > > > > enabled nor LTR values set by BIOS. This leads high power consumption on > > > > > these platforms when VMD is enabled as reported in bugzilla [1]. Enable > > > > > these features in the VMD driver using pcie_aspm_policy_override() to set > > > > > the ASPM policy for the root ports. > > > > > ... > > > > > > > To do this, add an additional flag in VMD features to specify > > > > > devices that must have their respective policies overridden. > > > > > > > > I'm not clear on why you want this to apply to only certain VMDs > > > > and not others. Do some BIOSes configure ASPM for devices below > > > > some VMDs? > > > > > > Not currently. But the plan is for future devices to move back to > > > having BIOS do the programming. > > > > Since this is apparently a BIOS design choice, it seems wrong to base > > the functionality on the Device ID instead of some signal that tells > > us what the BIOS is doing. > > I don't know if there is another way to tell what the BIOS is _not_ doing. DID is tied to the SoC so > it's our best method of inferring this. We know the firmware team has said they will not support > programming these values for the mentioned platforms so there's no BIOS update coming to change > this. I'd like to mention here that ChromeOS also runs linux kernel on TGL and ADL platforms, and we may run a different BIOS altogether, so yes, we'd greatly appreciate if you do not assume BIOS setting from device ID / SOC ID, if the setting is *not tied to device ID / SOC ID*. Perhaps you use ACPI or device trees to convey what BIOS has or hasn't configured? > But to be sure, we can add a check first to confirm that these values have not been programmed > (ASPM is disabled and LTRs are 0). Yes, reading the ASPM configuration from hardware sounds OK to me. > > > > > > > > + * Override the BIOS ASPM policy and set the LTR value for PCI storage > > > > > + * devices on the VMD bride. > > > > > > > > I don't think there's any BIOS "policy" here. At this point BIOS > > > > is no longer involved at all, so all that's left is whatever ASPM > > > > config the BIOS did or did not do. Should the Policy override really take place if aspm_policy = POLICY_DEFAULT which seems to suggest we leave it to whatever BIOS left it at? Perhaps for this override to take place, aspm_policy needs to be the policy needs to be POLICY_POWERSAVE? #define POLICY_DEFAULT 0 /* BIOS default setting */ #define POLICY_PERFORMANCE 1 /* high performance */ #define POLICY_POWERSAVE 2 /* high power saving */ #define POLICY_POWER_SUPERSAVE 3 /* possibly even more power saving */ > > > > > > > > Why only storage? > > > > > > Only storage devices will be on these root ports. > > > > How do you know this? You say below that there's an M.2 slot, so > > surely the slot could contain a non-storage device? Couldn't somebody > > build a platform with a VMD root port connected to a regular PCIe x4 > > slot? > > In VMD mode the root ports under the VMD bridge are used specifically to manage storage devices. So > only storage devices should be attached to these ports. Bjorn: from https://www.intel.com/content/www/us/en/architecture-and-technology/intel-volume-management-device-overview.html "Intel® Volume Management Device (Intel® VMD) is specifically designed for enterprise-grade management of NVMe SSDs connected to Intel® Xeon® CPUs." I think what David means is that any other device other than an NVME SSD showing up under VMD might be an error / bad device. > > > Couldn't such a slot support hotplug? > > The answer I just learned is yes. It may be possible to have a switch too. So both of those would > need to be addressed though I'm note quite sure how yet. > > > > > It would be very unusual to hard-code topology knowledge like this > > into the kernel, since plug-and-play has always been a major goal of > > PCI. > > > > > > > +static int vmd_enable_aspm(struct pci_dev *pdev, void *userdata) > > > > > +{ > > > > > + int features = *(int *)userdata, pos; > > > > > + > > > > > + if (!(features & VMD_FEAT_QUIRK_OVERRIDE_ASPM) || > > > > > + pdev->class != PCI_CLASS_STORAGE_EXPRESS) > > > > > + return 0; > > > > > > > + pci_write_config_word(pdev, pos + PCI_LTR_MAX_SNOOP_LAT, 0x1003); > > > > > + pci_write_config_word(pdev, pos + PCI_LTR_MAX_NOSNOOP_LAT, 0x1003); > > > > > > > > 1) Where did this magic 0x1003 value come from? Does that depend > > > > on the VMD device? The endpoint? The circuit design? The path > > > > between endpoint and VMD? What if there are switches in the path? > > > > > > The number comes from the BIOS team. They are tied to the SoC. I > > > don't believe there can be switches in the path but Nirmal and > > > Jonathan should know for sure. From what I've seen these root ports > > > are wired directly to M.2 slots on boards that are intended for > > > storage devices. > > > > I guess you're saying that 0x1003 is determined by the SoC. If so, I > > think this value should be in your .driver_data (which would mean > > converting it from a scalar to a struct, as many other drivers do). > > The current code suggests that 0x1003 is the correct value for *all* > > VMDs and all configurations. > > Okay > > > > > I don't understand LTR well enough to be convinced that this static > > value would be the correct value for all possible hierarchies and > > devices that could appear below a VMD root port. I would love to be > > educated about this. > > I am not an LTR expert either, but LTRs are "conglomerated" per spec. Switches subtract their own > latency from the minimum LTR of the downstream ports and report this conglomerated value upstream. > So the value will decrease as needed to account for latency in the hierarchy. > > PCI 5.0 Version 1.0, Section 6.18 (pg 611) As described here and as much as I understand, I think these LTR values (0x1003 here) depend on the platform topology. So what works for David's platform may not work for another one that may have Switched for e.g., or for perhaps for another NVME that is hotplugged later (and has different characteristics than what is on board). Technically tying the LTR values to a SOC or VID/DID sounds incorrect to me, and instead should be tied to the platform. Ideally I *thiink* they should come from ACPI or device tree. However, I wouldn't protest putting them in driver_data for now given that there aren't any other users. We may have to find a solution once there are more platforms that begin to use this driver. I think this is a bigger problem that needs to be addressed w.r.t. ASPM in general. Thanks & Best Regards, Rajat > > Thanks > > David > > > > > Bjorn > >