On 4/22/2024 3:52 PM, Bjorn Helgaas wrote:
On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote:
On 4/22/2024 1:27 PM, Bjorn Helgaas wrote:
On Fri, Apr 19, 2024 at 03:18:19PM -0700, Paul M Stillwell Jr wrote:
On 4/19/2024 2:14 PM, Bjorn Helgaas wrote:
On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote:
On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
Adding documentation for the Intel VMD driver and updating the index
file to include it.
- Which devices are passed through to a virtual guest and enumerated
there?
All devices under VMD are passed to a virtual guest
So the guest will see the VMD Root Ports, but not the VMD RCiEP
itself?
The guest will see the VMD device and then the vmd driver in the guest will
enumerate the devices behind it is my understanding
- Where does the vmd driver run (host or guest or both)?
I believe the answer is both.
If the VMD RCiEP isn't passed through to the guest, how can the vmd
driver do anything in the guest?
The VMD device is passed through to the guest. It works just like bare metal
in that the guest OS detects the VMD device and loads the vmd driver which
then enumerates the devices into the guest
I guess it's obvious that the VMD RCiEP must be passed through to the
guest because the whole point of
https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@xxxxxxxxx/
is to do something in the guest.
It does puzzle me that we have two copies of the vmd driver (one in
the host OS and another in the guest OS) that think they own the same
physical device. I'm not a virtualization guru but that sounds
potentially problematic.
IIUC, the current situation is "regardless of what firmware said, in
the VMD domain we want AER disabled and hotplug enabled."
We aren't saying we want AER disabled, we are just saying we want hotplug
enabled. The observation is that in a hypervisor scenario AER is going to be
disabled because the _OSC bits are all 0.
04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") is saying
we want AER disabled for the VMD domain, isn't it?
I don't see it that way, but maybe I'm misunderstanding something. Here is
the code from that commit (only the portion of interest):
+/*
+ * Since VMD is an aperture to regular PCIe root ports, only allow it to
+ * control features that the OS is allowed to control on the physical PCI
bus.
+ */
+static void vmd_copy_host_bridge_flags(struct pci_host_bridge *root_bridge,
+ struct pci_host_bridge *vmd_bridge)
+{
+ vmd_bridge->native_pcie_hotplug = root_bridge->native_pcie_hotplug;
+ vmd_bridge->native_shpc_hotplug = root_bridge->native_shpc_hotplug;
+ vmd_bridge->native_aer = root_bridge->native_aer;
+ vmd_bridge->native_pme = root_bridge->native_pme;
+ vmd_bridge->native_ltr = root_bridge->native_ltr;
+ vmd_bridge->native_dpc = root_bridge->native_dpc;
+}
+
When I look at this, we are copying whatever is in the root_bridge to the
vmd_bridge.
Right. We're copying the results of the _OSC that applies to the VMD
RCiEP (not an _OSC that applies to the VMD domain).
In a bare metal scenario, this is correct and AER will be whatever
the BIOS set up (IIRC on my bare metal system AER is enabled).
I think the first dmesg log at
https://bugzilla.kernel.org/show_bug.cgi?id=215027 is from a host OS:
[ 0.000000] DMI: Dell Inc. Dell G15 Special Edition 5521/, BIOS 0.4.3 10/20/2021
[ 0.408990] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-e0])
[ 0.410076] acpi PNP0A08:00: _OSC: platform does not support [AER]
[ 0.412207] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME PCIeCapability LTR]
[ 1.367220] vmd 0000:00:0e.0: PCI host bridge to bus 10000:e0
[ 1.486704] pcieport 10000:e0:06.0: AER: enabled with IRQ 146
We evaluated _OSC for domain 0000 with OSC_QUERY_ENABLE ("Query
Support Flag" in the specs) and learned the platform doesn't support
AER, so we removed that from the features we request. We don't
request control of AER in domain 0000, so native_aer will be 0. AER
might be enabled by firmware, but the host Linux should not enable it
in domain 0000.
In domain 10000, the host Linux *does* enable AER because all new host
bridges (including the new VMD domain 10000) assume native control to
start with, and we update that based on _OSC results. There's no ACPI
device describing the VMD host bridge, so there's no _OSC that applies
to it, so Linux assumes it owns everything.
In a hypervisor scenario the root_bridge bits will all be 0 which
effectively disables AER and all the similar bits.
I don't think https://bugzilla.kernel.org/show_bug.cgi?id=215027
includes a dmesg log from a hypervisor guest, so I don't know what
happens there.
The host_bridge->native_* bits always start as 1 (OS owns the feature)
and they only get cleared if there's an _OSC that retains firmware
ownership.
Prior to this commit all the native_* bits were set to 1 because
pci_init_host_bridge() sets them all to 1 so we were enabling AER et all
despite what the OS may have wanted. With the commit we are setting the bits
to whatever the BIOS has set them to.
Prior to 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"),
the host OS owned all features in the VMD domain. After 04b12ef163d1,
it depended on whatever the _OSC for the VMD RCiEP said.
On the Dell mentioned in that bugzilla, the domain 0000 _OSC said AER
wasn't supported, and 04b12ef163d1 applied that to the VMD domain as
well, so the host OS didn't enable AER in either domain.
But other machines would have different answers. On a machine that
supports AER and grants ownership to the OS, 04b12ef163d1 would enable
AER in both domain 0000 and the VMD domain.
I don't think we know the root cause of the Correctable Errors from
that bugzilla. But it's likely that they would still occur on a
machine that granted AER to the OS, and if we enable AER in the VMD
domain, we would probably still have the flood.
That's is why I think 04b12ef163d1 is problematic: it only disables
AER in the VMD domain when AER happens to be disabled in domain 0000.
That's a great explanation and very helpful to me, thanks for that!
It seems like the only clear option is to say "the vmd driver owns all
PCIe services in the VMD domain, the platform does not supply _OSC for
the VMD domain, the platform can't do anything with PCIe services in
the VMD domain, and the vmd driver needs to explicitly enable/disable
services as it needs."
I actually looked at this as well :) I had an idea to set the _OSC bits to 0
when the vmd driver created the domain. The look at all the root ports
underneath it and see if AER and PM were set. If any root port underneath
VMD set AER or PM then I would set the _OSC bit for the bridge to 1. That
way if any root port underneath VMD had enabled AER (as an example) then
that feature would still work. I didn't test this in a hypervisor scenario
though so not sure what I would see.
_OSC negotiates ownership of features between platform firmware and
OSPM. The "native_pcie_hotplug" and similar bits mean that "IF a
device advertises the feature, the OS can use it." We clear those
native_* bits if the platform retains ownership via _OSC.
If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for
the domain below it, why would we assume that BIOS retains ownership
of the features negotiated by _OSC? I think we have to assume the OS
owns them, which is what happened before 04b12ef163d1.
Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled)
then all the root ports and devices underneath VMD are visible to the BIOS
and OS so ACPI would run on all of them and the _OSC bits should be set
correctly.
Sorry, that was confusing. I think there are two pieces to enabling
VMD:
1) There's the BIOS "VMD enable" switch. If set, the VMD device
appears as an RCiEP and the devices behind it are invisible to the
BIOS. If cleared, VMD doesn't exist; the VMD RCiEP is hidden and
the devices behind it appear as normal Root Ports with devices below
them.
2) When the BIOS "VMD enable" is set, the OS vmd driver configures
the VMD RCiEP and enumerates things below the VMD host bridge.
In this case, BIOS enables the VMD RCiEP, but it doesn't have a
driver for it and it doesn't know how to enumerate the VMD Root
Ports, so I don't think it makes sense for BIOS to own features for
devices it doesn't know about.
That makes sense to me. It sounds like VMD should own all the features,
I just don't know how the vmd driver would set the bits other than
hotplug correctly... We know leaving them on is problematic, but I'm not
sure what method to use to decide which of the other bits should be set
or not.
These other features are in a no mans land. I did consider running ACPI
for the devices behind VMD, but the ACPI functions are almost all static
to acpi.c (IIRC) and I didn't feel like I should rototill the ACPI code
for this...
I'm hesitant to set all the other feature bits to 0 as well because we
may have customers who want AER on their devices behind VMD. I'll do
some research on my side to see how we enable/handle other features.
If it turns out that we never use AER or any of the other features would
it be ok if we set everything other than hotplug to 0 and print some
messages similar to the OS stating that VMD owns all these features and
isn't supporting them?
Paul