On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
[+cc Keith]
On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
Adding documentation for the Intel VMD driver and updating the index
file to include it.
Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@xxxxxxxxx>
---
Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++
Documentation/PCI/index.rst | 1 +
2 files changed, 52 insertions(+)
create mode 100644 Documentation/PCI/controller/vmd.rst
diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst
new file mode 100644
index 000000000000..e1a019035245
--- /dev/null
+++ b/Documentation/PCI/controller/vmd.rst
@@ -0,0 +1,51 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=================================================================
+Linux Base Driver for the Intel(R) Volume Management Device (VMD)
+=================================================================
+
+Intel vmd Linux driver.
+
+Contents
+========
+
+- Overview
+- Features
+- Limitations
+
+The Intel VMD provides the means to provide volume management across separate
+PCI Express HBAs and SSDs without requiring operating system support or
+communication between drivers. It does this by obscuring each storage
+controller from the OS, but allowing a single driver to be loaded that would
+control each storage controller. A Volume Management Device (VMD) provides a
+single device for a single storage driver. The VMD resides in the IIO root
I'm not sure IIO (and PCH below) are really relevant to this. I think
I'm trying to describe where in the CPU architecture VMD exists because
it's not like other devices. It's not like a storage device or
networking device that is plugged in somewhere; it exists as part of the
CPU (in the IIO). I'm ok removing it, but it might be confusing to
someone looking at the documentation. I'm also close to this so it may
be clear to me, but confusing to others (which I know it is) so any help
making it clearer would be appreciated.
we really just care about the PCI topology enumerable by the OS. If
they are relevant, expand them on first use as you did for VMD so we
have a hint about how to learn more about it.
I don't fully understand this comment. The PCI topology behind VMD is
not enumerable by the OS unless we are considering the vmd driver the
OS. If the user enables VMD in the BIOS and the vmd driver isn't loaded,
then the OS never sees the devices behind VMD.
The only reason the devices are seen by the OS is that the VMD driver
does some mapping when the VMD driver loads during boot.
+complex and it appears to the OS as a root bus integrated endpoint. In the IIO,
I suspect "root bus integrated endpoint" means the same as "Root
Complex Integrated Endpoint" as defined by the PCIe spec? If so,
please use that term and capitalize it so there's no confusion.
OK, will fix.
+the VMD is in a central location to manipulate access to storage devices which
+may be attached directly to the IIO or indirectly through the PCH. Instead of
+allowing individual storage devices to be detected by the OS and allow it to
+load a separate driver instance for each, the VMD provides configuration
+settings to allow specific devices and root ports on the root bus to be
+invisible to the OS.
How are these settings configured? BIOS setup menu?
I believe there are 2 ways this is done:
The first is that the system designer creates a design such that some
root ports and end points are behind VMD. If VMD is enabled in the BIOS
then these devices don't show up to the OS and require a driver to use
them (the vmd driver). If VMD is disabled in the BIOS then the devices
are seen by the OS at boot time.
The second way is that there are settings in the BIOS for VMD. I don't
think there are many settings... it's mostly enable/disable VMD
+VMD works by creating separate PCI domains for each VMD device in the system.
+This makes VMD look more like a host bridge than an endpoint so VMD must try
+to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system.
As Keith pointed out, I think this needs more details about how the
hardware itself works. I don't think there's enough information here
to maintain the OS/platform interface on an ongoing basis.
I think "creating a separate PCI domain" is a consequence of providing
a new config access mechanism, e.g., a new ECAM region, for devices
below the VMD bridge. That hardware mechanism is important to
understand because it means those downstream devices are unknown to
anything that doesn't grok the config access mechanism. For example,
firmware wouldn't know anything about them unless it had a VMD driver.
Some of the pieces that might help figure this out:
I'll add some details to answer these in the documentation, but I'll
give a brief answer here as well
- Which devices (VMD bridge, VMD Root Ports, devices below VMD Root
Ports) are enumerated in the host?
Only the VMD device (as a PCI end point) are seen by the OS without the
vmd driver
- Which devices are passed through to a virtual guest and enumerated
there?
All devices under VMD are passed to a virtual guest
- Where does the vmd driver run (host or guest or both)?
I believe the answer is both.
- Who (host or guest) runs the _OSC for the new VMD domain?
I believe the answer here is neither :) This has been an issue since
commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"). I've
submitted this patch
(https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@xxxxxxxxx/)
to attempt to fix the issue.
You are much more of an expert in this area than I am, but as far as I
can tell the only way the _OSC bits get cleared is via ACPI
(specifically this code
https://elixir.bootlin.com/linux/latest/source/drivers/acpi/pci_root.c#L1038).
Since ACPI doesn't run on the devices behind VMD the _OSC bits don't get
set properly for them.
Ultimately the only _OSC bits that VMD cares about are the hotplug bits
because that is a feature of our device; it enables hotplug in guests
where there is no way to enable it. That's why my patch is to set them
all the time and copy the other _OSC bits because there is no other way
to enable this feature (i.e. there is no user space tool to
enable/disable it).
- What happens to interrupts generated by devices downstream from
VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts
from VMD Root Ports or switch downstream ports? Who fields them?
In general firmware would field them unless it grants ownership
via _OSC. If firmware grants ownership (or the OS forcibly takes
it by overriding it for hotplug), I guess the OS that requested
ownership would field them?
The interrupts are passed through VMD to the OS. This was the AER issue
that resulted in commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe
features"). IIRC AER was disabled in the BIOS, but is was enabled in the
VMD host bridge because pci_init_host_bridge() sets all the bits to 1
and that generated an AER interrupt storm.
In bare metal scenarios the _OSC bits are correct, but in a hypervisor
scenario the bits are wrong because they are all 0 regardless of what
the ACPI tables indicate. The challenge is that the VMD driver has no
way to know it's in a hypervisor to set the hotplug bits correctly.
- How do interrupts (hotplug, AER, etc) for things below VMD work?
Assuming the OS owns the feature, how does the OS discover them?
I feel like this is the same question as above? Or maybe I'm missing a
subtlety about this...
I guess probably the usual PCIe Capability and MSI/MSI-X
Capabilities? Which OS (host or guest) fields them?
+A couple of the _OSC flags regard hotplug support. Hotplug is a feature that
+is always enabled when using VMD regardless of the _OSC flags.
We log the _OSC negotiation in dmesg, so if we ignore or override _OSC
for hotplug, maybe that should be made explicit in the logging
somehow?
That's a really good idea and something I can add to
https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@xxxxxxxxx/
Would a message like this help from the VMD driver?
"VMD enabled, hotplug enabled by VMD"
+Features
+========
+
+- Virtualization
+- MSIX interrupts
+- Power Management
+- Hotplug
s/MSIX/MSI-X/ to match spec usage.
I'm not sure what this list is telling us.
Will fix
+Limitations
+===========
+
+When VMD is enabled and used in a hypervisor the _OSC flags provided by the
+hypervisor BIOS may not be correct. The most critical of these flags are the
+hotplug bits. If these bits are incorrect then the storage devices behind the
+VMD will not be able to be hotplugged. The driver always supports hotplug for
+the devices behind it so the hotplug bits reported by the OS are not used.
"_OSC may not be correct" sounds kind of problematic. How does the
OS deal with this? How does the OS know whether to pay attention to
_OSC or ignore it because it tells us garbage?
That's the $64K question, lol. We've been trying to solve that since
commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") :)
If we ignore _OSC hotplug bits because "we know what we want, and we
know we won't conflict with firmware," how do we deal with other _OSC
bits? AER? PME? What about bits that may be added in the future?
Is there some kind of roadmap to help answer these questions?
As I mentioned earlier, VMD only really cares about hotplug because that
is the feature we enable for guests (and hosts).
I believe the solution is to use the root bridge settings for all other
bits (which is what is happening currently). What this will mean in
practice is that in a bare metal scenario the bits will be correct for
all the features (AER et al) and that in a guest scenario all the bits
other than hotplug (which we will enable always) will be 0 (that's what
we see in all hypervisor scenarios we've tested) which is fine for us
because we don't care about any of the other bits.
That's why I think it's ok for us to set the hotplug bits to 1 when the
VMD driver loads; we aren't harming any other devices, we are enabling a
feature that we know our users want and we are setting all the other
_OSC bits "correctly" (for some values of correctly :) )
I appreciate your feedback and I'll start working on updating the
documentation to make it clearer. I'll wait to send a v2 until I feel
like we've finished our discussion from this one.
Paul
Bjorn