Re: [PATCH] Documentation: PCI: add vmd documentation

Paul M Stillwell Jr <paul.m.stillwell.jr@xxxxxxxxx> · Thu, 18 Apr 2024 14:51:19 -0700

On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
[+cc Keith]

On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
Adding documentation for the Intel VMD driver and updating the index
file to include it.

Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@xxxxxxxxx>
---
  Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++
  Documentation/PCI/index.rst          |  1 +
  2 files changed, 52 insertions(+)
  create mode 100644 Documentation/PCI/controller/vmd.rst

diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst
new file mode 100644
index 000000000000..e1a019035245
--- /dev/null
+++ b/Documentation/PCI/controller/vmd.rst
@@ -0,0 +1,51 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=================================================================
+Linux Base Driver for the Intel(R) Volume Management Device (VMD)
+=================================================================
+
+Intel vmd Linux driver.
+
+Contents
+========
+
+- Overview
+- Features
+- Limitations
+
+The Intel VMD provides the means to provide volume management across separate
+PCI Express HBAs and SSDs without requiring operating system support or
+communication between drivers. It does this by obscuring each storage
+controller from the OS, but allowing a single driver to be loaded that would
+control each storage controller. A Volume Management Device (VMD) provides a
+single device for a single storage driver. The VMD resides in the IIO root

I'm not sure IIO (and PCH below) are really relevant to this.  I think

I'm trying to describe where in the CPU architecture VMD exists because 
it's not like other devices. It's not like a storage device or 
networking device that is plugged in somewhere; it exists as part of the 
CPU (in the IIO). I'm ok removing it, but it might be confusing to 
someone looking at the documentation. I'm also close to this so it may 
be clear to me, but confusing to others (which I know it is) so any help 
making it clearer would be appreciated.

we really just care about the PCI topology enumerable by the OS.  If
they are relevant, expand them on first use as you did for VMD so we
have a hint about how to learn more about it.


I don't fully understand this comment. The PCI topology behind VMD is 
not enumerable by the OS unless we are considering the vmd driver the 
OS. If the user enables VMD in the BIOS and the vmd driver isn't loaded, 
then the OS never sees the devices behind VMD.

The only reason the devices are seen by the OS is that the VMD driver 
does some mapping when the VMD driver loads during boot.

+complex and it appears to the OS as a root bus integrated endpoint. In the IIO,

I suspect "root bus integrated endpoint" means the same as "Root
Complex Integrated Endpoint" as defined by the PCIe spec?  If so,
please use that term and capitalize it so there's no confusion.


OK, will fix.

+the VMD is in a central location to manipulate access to storage devices which
+may be attached directly to the IIO or indirectly through the PCH. Instead of
+allowing individual storage devices to be detected by the OS and allow it to
+load a separate driver instance for each, the VMD provides configuration
+settings to allow specific devices and root ports on the root bus to be
+invisible to the OS.

How are these settings configured?  BIOS setup menu?


I believe there are 2 ways this is done:

The first is that the system designer creates a design such that some 
root ports and end points are behind VMD. If VMD is enabled in the BIOS 
then these devices don't show up to the OS and require a driver to use 
them (the vmd driver). If VMD is disabled in the BIOS then the devices 
are seen by the OS at boot time.

The second way is that there are settings in the BIOS for VMD. I don't 
think there are many settings... it's mostly enable/disable VMD

+VMD works by creating separate PCI domains for each VMD device in the system.
+This makes VMD look more like a host bridge than an endpoint so VMD must try
+to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system.

As Keith pointed out, I think this needs more details about how the
hardware itself works.  I don't think there's enough information here
to maintain the OS/platform interface on an ongoing basis.

I think "creating a separate PCI domain" is a consequence of providing
a new config access mechanism, e.g., a new ECAM region, for devices
below the VMD bridge.  That hardware mechanism is important to
understand because it means those downstream devices are unknown to
anything that doesn't grok the config access mechanism.  For example,
firmware wouldn't know anything about them unless it had a VMD driver.

Some of the pieces that might help figure this out:


I'll add some details to answer these in the documentation, but I'll 
give a brief answer here as well

   - Which devices (VMD bridge, VMD Root Ports, devices below VMD Root
     Ports) are enumerated in the host?


Only the VMD device (as a PCI end point) are seen by the OS without the 
vmd driver

   - Which devices are passed through to a virtual guest and enumerated
     there?


All devices under VMD are passed to a virtual guest

   - Where does the vmd driver run (host or guest or both)?


I believe the answer is both.

   - Who (host or guest) runs the _OSC for the new VMD domain?


I believe the answer here is neither :) This has been an issue since 
commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"). I've 
submitted this patch 
(https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@xxxxxxxxx/) 
to attempt to fix the issue.

You are much more of an expert in this area than I am, but as far as I 
can tell the only way the _OSC bits get cleared is via ACPI 
(specifically this code 
https://elixir.bootlin.com/linux/latest/source/drivers/acpi/pci_root.c#L1038). 
Since ACPI doesn't run on the devices behind VMD the _OSC bits don't get 
set properly for them.

Ultimately the only _OSC bits that VMD cares about are the hotplug bits 
because that is a feature of our device; it enables hotplug in guests 
where there is no way to enable it. That's why my patch is to set them 
all the time and copy the other _OSC bits because there is no other way 
to enable this feature (i.e. there is no user space tool to 
enable/disable it).

   - What happens to interrupts generated by devices downstream from
     VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts
     from VMD Root Ports or switch downstream ports?  Who fields them?
     In general firmware would field them unless it grants ownership
     via _OSC.  If firmware grants ownership (or the OS forcibly takes
     it by overriding it for hotplug), I guess the OS that requested
     ownership would field them?


The interrupts are passed through VMD to the OS. This was the AER issue 
that resulted in commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe 
features"). IIRC AER was disabled in the BIOS, but is was enabled in the 
VMD host bridge because pci_init_host_bridge() sets all the bits to 1 
and that generated an AER interrupt storm.

In bare metal scenarios the _OSC bits are correct, but in a hypervisor 
scenario the bits are wrong because they are all 0 regardless of what 
the ACPI tables indicate. The challenge is that the VMD driver has no 
way to know it's in a hypervisor to set the hotplug bits correctly.

   - How do interrupts (hotplug, AER, etc) for things below VMD work?
     Assuming the OS owns the feature, how does the OS discover them?

I feel like this is the same question as above? Or maybe I'm missing a 
subtlety about this...

     I guess probably the usual PCIe Capability and MSI/MSI-X
     Capabilities?  Which OS (host or guest) fields them?

+A couple of the _OSC flags regard hotplug support.  Hotplug is a feature that
+is always enabled when using VMD regardless of the _OSC flags.

We log the _OSC negotiation in dmesg, so if we ignore or override _OSC
for hotplug, maybe that should be made explicit in the logging
somehow?


That's a really good idea and something I can add to 
https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@xxxxxxxxx/

Would a message like this help from the VMD driver?

"VMD enabled, hotplug enabled by VMD"

+Features
+========
+
+- Virtualization
+- MSIX interrupts
+- Power Management
+- Hotplug

s/MSIX/MSI-X/ to match spec usage.

I'm not sure what this list is telling us.


Will fix

+Limitations
+===========
+
+When VMD is enabled and used in a hypervisor the _OSC flags provided by the
+hypervisor BIOS may not be correct. The most critical of these flags are the
+hotplug bits. If these bits are incorrect then the storage devices behind the
+VMD will not be able to be hotplugged. The driver always supports hotplug for
+the devices behind it so the hotplug bits reported by the OS are not used.

"_OSC may not be correct" sounds kind of problematic.  How does the
OS deal with this?  How does the OS know whether to pay attention to
_OSC or ignore it because it tells us garbage?


That's the $64K question, lol. We've been trying to solve that since 
commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") :)

If we ignore _OSC hotplug bits because "we know what we want, and we
know we won't conflict with firmware," how do we deal with other _OSC
bits?  AER?  PME?  What about bits that may be added in the future?
Is there some kind of roadmap to help answer these questions?


As I mentioned earlier, VMD only really cares about hotplug because that 
is the feature we enable for guests (and hosts).

I believe the solution is to use the root bridge settings for all other 
bits (which is what is happening currently). What this will mean in 
practice is that in a bare metal scenario the bits will be correct for 
all the features (AER et al) and that in a guest scenario all the bits 
other than hotplug (which we will enable always) will be 0 (that's what 
we see in all hypervisor scenarios we've tested) which is fine for us 
because we don't care about any of the other bits.

That's why I think it's ok for us to set the hotplug bits to 1 when the 
VMD driver loads; we aren't harming any other devices, we are enabling a 
feature that we know our users want and we are setting all the other 
_OSC bits "correctly" (for some values of correctly :) )

I appreciate your feedback and I'll start working on updating the 
documentation to make it clearer. I'll wait to send a v2 until I feel 
like we've finished our discussion from this one.

Paul

Bjorn