On Tue, Sep 19, 2023 at 4:34 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Tue, Sep 19, 2023 at 11:31:57AM +0800, Kai-Heng Feng wrote: > > On Wed, Sep 13, 2023 at 8:50 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > [snipped] > > > Hmm. In some ways the VMD device acts as a Root Port, since it > > > originates a new hierarchy in a separate domain, but on the upstream > > > side, it's just a normal endpoint. > > > > > > How does AER for the new hierarchy work? A device below the VMD can > > > generate ERR_COR/ERR_NONFATAL/ERR_FATAL messages. I guess I was > > > assuming those messages would terminate at the VMD, and the VMD could > > > generate an AER interrupt just like a Root Port. But that can't be > > > right because I don't think VMD would have the Root Error Command > > > register needed to manage that interrupt. > > > > VMD itself doesn't seem to manage AER, the rootport that "moved" from > > 0000 domain does: > > [ 2113.507345] pcieport 10000:e0:06.0: AER: Corrected error received: > > 10000:e1:00.0 > > [ 2113.507380] nvme 10000:e1:00.0: PCIe Bus Error: severity=Corrected, > > type=Physical Layer, (Receiver ID) > > [ 2113.507389] nvme 10000:e1:00.0: device [144d:a80a] error > > status/mask=00000001/0000e000 > > [ 2113.507398] nvme 10000:e1:00.0: [ 0] RxErr (First) > > Oh, I forgot how VMD works. It sounds like there *is* a Root Port > that is logically below the VMD, e.g., (from > https://bugzilla.kernel.org/show_bug.cgi?id=215027): > > ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-e0]) > acpi PNP0A08:00: _OSC: platform does not support [AER] > acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME PCIeCapability LTR] > pci 0000:00:0e.0: [8086:467f] type 00 # VMD > vmd 0000:00:0e.0: PCI host bridge to bus 10000:e0 > pci 10000:e0:06.0: [8086:464d] type 01 # Root Port to [bus e1] > pci 10000:e1:00.0: [144d:a80a] type 00 # Samsung NVMe > > So ERR_* messages from the e1:00.0 Samsung device would terminate at > the e0:06.0 Root Port. That Root Port has an AER Capability with Root > Error Command/Status/Error Source registers. > > > > But if VMD just passes those messages up to the Root Port, the source > > > of the messages (the Requester ID) won't make any sense because > > > they're in a hierarchy the Root Port doesn't know anything about. > > > > Not sure what's current status is but I think Nirmal's patch is valid > > for both our cases. > > So I think the question is whether that PNP0A08:00 _OSC applies to > domain 10000. I think the answer is "no" because the platform doesn't > know about the existence of domain 10000, and it can't access config > space in that domain. Well, the VMD device itself is there in domain 0000, however, and sure enough, the platform firmware can access its config space. > E.g., if _OSC negotiated that the platform owned AER in domain 0000, I > don't think it would make sense for that to mean the platform *also* > owned AER in domain 10000, because the platform doesn't know how to > configure AER or handle AER interrupts in that domain. I'm not sure about this. AFAICS, domain 10000 is not physically independent of domain 0000, so I'm not sure to what extent the above applies. > Nirmal's patch ignores _OSC for hotplug, but keeps the _OSC results > for AER, PME, and LTR. I think we should ignore _OSC for *all* of > them. > > That would mean reverting 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on > PCIe features") completely, so of course we'd have to figure out how > to resolve the AER message flood a different way. I agree with the above approach, however.