On Thu, May 25, 2023 at 11:29:58PM +0200, Robert Richter wrote: > eOn 24.05.23 16:32:35, Bjorn Helgaas wrote: > > On Tue, May 23, 2023 at 06:22:13PM -0500, Terry Bowman wrote: > > > From: Robert Richter <rrichter@xxxxxxx> > > > > > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an > > > RCiEP, but CXL downstream and upstream ports are not enumerated and > > > not visible in the PCIe hierarchy. Protocol and link errors are sent > > > to an RCEC. > > > > > > Restricted CXL host (RCH) downstream port-detected errors are signaled > > > as internal AER errors, either Uncorrectable Internal Error (UIE) or > > > Corrected Internal Errors (CIE). > > > > From the parallelism with RCD above, I first thought that RCH devices > > were non-RCD mode and *were* enumerated as part of the PCIe hierarchy, > > but actually I suspect it's more like the following? > > > > ... but CXL downstream and upstream ports are not enumerated and not > > visible in the PCIe hierarchy. > > > > Protocol and link errors from these non-enumerated ports are > > signaled as internal AER errors ... via a CXL RCEC. > > Exactly, except the RCEC is standard PCIe and also must not > necessarily on the same PCI bus as the CXL RCiEPs are. So make it "RCEC" instead of "CXL RCEC", I guess? PCIe r6.0, sec 7.9.10.3, allows an RCEC to be associated with RCiEPs on different buses, so nothing to see there. > > > The error source is the id of the RCEC. > > > > This seems odd; I assume this refers to the RCEC's AER Error Source > > Identification register, and the ERR_COR or ERR_FATAL/NONFATAL Source > > Identification would ordinarily be the Requester ID of the RCiEP that > > "sent" the Error Message. But you're saying it's actually the ID of > > the *RCEC*, not the RCiEP? > > Right, the downstream port has its own AER ext capability in > non-config (io mapped) RCRB register range. Errors originating from > there are signaled as internal AER errors via the RCEC *with* the > RCEC's Requester ID. Code walks through all associated CXL endpoints, > determines the dport and checks its AER. > > There is also an RDPAS structure defined in CXL but that is only a > different way to provide the RCEC to dport association instead of > using the RCEC's Endpoint Association Extended Capability. In the end > we get all associated RCHs and check the AER of all their dports. > > The upstream port is signaled using the RCiEP's AER. CXL spec is > strict here: "Upstream Port RCRB shall not implement the AER Extended > Capability." The RCiEP's requestor ID is used then and its config > space the AER is in. > > CXL.cachemem errors are reported with the RCiEP as requester > too. Status is in the CXL RAS cap and the UIE or CIE is set > respectively in the AER status of the RCiEP. > > > We're going to call pci_aer_handle_error() as well, to handle the > > non-internal errors, and I'm pretty sure that path expects the RCiEP > > ID there. > > > > Whatever the answer, I'm not sure this sentence is actually relevant > > to this patch, since this patch doesn't read PCI_ERR_ROOT_ERR_SRC or > > look at struct aer_err_source.id. > > The source id is used in aer_process_err_devices() which finally calls > handle_error_source() for the device with the requestor id. This is > the place where cxl_rch_handle_error() checks if it is an RCEC that > received an internal error and has cxl devices connected to it. Then, > the request is forwarded to the cxl_mem handler which also needs to > check the dport now. That is, pcie_walk_rcec() in > cxl_rch_handle_error() is called with the RCEC's pci handle, > cxl_rch_handle_error_iter() with the RCiEP's pci handle. I'm still not sure this is relevant. Isn't that last sentence just the way we always use pcie_walk_rcec()? If there's something *different* here about CXL, and it's important to this patch, sure. But I don't see that yet. Maybe a comment in the code if you think it's important to clarify something there. Bjorn