Re: [PATCH] Documentation: PCI: add vmd documentation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/23/2024 5:47 PM, Bjorn Helgaas wrote:
On Tue, Apr 23, 2024 at 04:10:37PM -0700, Paul M Stillwell Jr wrote:
On 4/23/2024 2:26 PM, Bjorn Helgaas wrote:
On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote:
On 4/22/2024 3:52 PM, Bjorn Helgaas wrote:
On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote:
On 4/22/2024 1:27 PM, Bjorn Helgaas wrote:
...

_OSC negotiates ownership of features between platform firmware and
OSPM.  The "native_pcie_hotplug" and similar bits mean that "IF a
device advertises the feature, the OS can use it."  We clear those
native_* bits if the platform retains ownership via _OSC.

If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for
the domain below it, why would we assume that BIOS retains ownership
of the features negotiated by _OSC?  I think we have to assume the OS
owns them, which is what happened before 04b12ef163d1.

Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled)
then all the root ports and devices underneath VMD are visible to the BIOS
and OS so ACPI would run on all of them and the _OSC bits should be set
correctly.

Sorry, that was confusing.  I think there are two pieces to enabling
VMD:

     1) There's the BIOS "VMD enable" switch.  If set, the VMD device
     appears as an RCiEP and the devices behind it are invisible to the
     BIOS.  If cleared, VMD doesn't exist; the VMD RCiEP is hidden and
     the devices behind it appear as normal Root Ports with devices below
     them.

     2) When the BIOS "VMD enable" is set, the OS vmd driver configures
     the VMD RCiEP and enumerates things below the VMD host bridge.

     In this case, BIOS enables the VMD RCiEP, but it doesn't have a
     driver for it and it doesn't know how to enumerate the VMD Root
     Ports, so I don't think it makes sense for BIOS to own features for
     devices it doesn't know about.

That makes sense to me. It sounds like VMD should own all the features, I
just don't know how the vmd driver would set the bits other than hotplug
correctly... We know leaving them on is problematic, but I'm not sure what
method to use to decide which of the other bits should be set or not.

My starting assumption would be that we'd handle the VMD domain the
same as other PCI domains: if a device advertises a feature, the
kernel includes support for it, and the kernel owns it, we enable it.

I've been poking around and it seems like some things (I was looking for
AER) are global to the platform. In my investigation (which is a small
sample size of machines) it looks like there is a single entry in the BIOS
to enable/disable AER so whatever is in one domain should be the same in all
the domains. I couldn't find settings for LTR or the other bits, but I'm not
sure what to look for in the BIOS for those.

So it seems that there are 2 categories: platform global and device
specific. AER and probably some of the others are global and can be copied
from one domain to another, but things like hotplug are device specific and
should be handled that way.

_OSC is the only mechanism for negotiating ownership of these
features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC
only applies to the hierarchy originated by the PNP0A03/PNP0A08 host
bridge that contains the _OSC method.  AFAICT, there's no
global/device-specific thing here.

The BIOS may have a single user-visible setting, and it may apply that
setting to all host bridge _OSC methods, but that's just part of the
BIOS UI, not part of the firmware/OS interface.


Fair, but we are still left with the question of how to set the _OSC bits for the VMD bridge. This would normally happen using ACPI AFAICT and we don't have that for the devices behind VMD.

If a device advertises a feature but there's a hardware problem with
it, the usual approach is to add a quirk to work around the problem.
The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd:
Honor ACPI _OSC on PCIe features"), looks like it might be in this
category.

I don't think we had a hardware problem with these Samsung (IIRC) devices;
the issue was that the vmd driver were incorrectly enabling AER because
those native_* bits get set automatically.

Where do all the Correctable Errors come from?  IMO they're either
caused by some hardware issue or by a software error in programming
AER.  It's possible we forget to clear the errors and we just see the
same error reported over and over.  But I don't think the answer is
to copy the AER ownership from a different domain.


I looked back at the original bugzilla and I feel like the AER errors are a red herring. AER was *supposed* to be disabled, but was incorrectly enabled by VMD so we are seeing errors. Yes, they may be real errors, but my point is that the user had disabled AER so they didn't care if there were errors or not (i.e. if AER had been correctly disabled by VMD then the user would not have AER errors in the dmesg output).

Kai-Heng even says this in one of his responses here https://lore.kernel.org/linux-pci/CAAd53p6hATV8TOcJ9Qi2rMwVi=y_9+tQu6KhDkAm6Y8=cQ_xoA@xxxxxxxxxxxxxx/. A quote from his reply "To be more precise, AER is disabled by the platform vendor in BIOS to paper over the issue."

Paul




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux