On 11/27/2024 11:05 AM, Jonathan Cameron wrote: > On Thu, 21 Nov 2024 14:24:17 -0600 > "Bowman, Terry" <terry.bowman@xxxxxxx> wrote: > >> On 11/15/2024 3:35 AM, Lukas Wunner wrote: >>> On Wed, Nov 13, 2024 at 03:54:20PM -0600, Terry Bowman wrote: >>>> The AER service driver's aer_get_device_error_info() function doesn't read >>>> uncorrectable (UCE) fatal error status from PCIe upstream port devices, >>>> including CXL upstream switch ports. As a result, fatal errors are not >>>> logged or handled as needed for CXL PCIe upstream switch port devices. >>>> >>>> Update the aer_get_device_error_info() function to read the UCE fatal >>>> status for all CXL PCIe port devices. Make the change to not affect >>>> non-CXL PCIe devices. >>>> >>>> The fatal error status will be used in future patches implementing >>>> CXL PCIe port uncorrectable error handling and logging. >>> [...] >>>> --- a/drivers/pci/pcie/aer.c >>>> +++ b/drivers/pci/pcie/aer.c >>>> @@ -1250,7 +1250,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info) >>>> } else if (type == PCI_EXP_TYPE_ROOT_PORT || >>>> type == PCI_EXP_TYPE_RC_EC || >>>> type == PCI_EXP_TYPE_DOWNSTREAM || >>>> - info->severity == AER_NONFATAL) { >>>> + info->severity == AER_NONFATAL || >>>> + (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) { >>>> >>>> /* Link is still healthy for IO reads */ >>>> pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, >>> Just a heads-up, there's another patch pending by Shuai Xue (+cc) >>> which touches the same code lines. It re-enables error reporting >>> for PCIe Upstream Ports (as well as Endpoints) under certain >>> conditions: >>> >>> https://lore.kernel.org/all/20241112135419.59491-3-xueshuai@xxxxxxxxxxxxxxxxx/ >>> >>> That was originally disabled by Keith Busch (+cc) with commit >>> 9d938ea53b26 ("PCI/AER: Don't read upstream ports below fatal errors"). >>> >>> There's some merge conflict potential here if your series goes into >>> the cxl tree and Shuai's patch into the pci tree in the next cycle. >>> >>> Thanks, >>> >>> Lukas >> Thanks Lukas I took a look at the patchset and reached out to Shuai (you're CC'd). Sorry, I thought >> I responded here earlier. > I'm guessing we might not need this change if we can base querying on the > link being good. If the error is on the CXL protocol side, the link should > still be fine I think? > > Jonathan Hi Jonathan, Shuai is determining upstream link viability using a call to pciehp_check_link_active() in dpc.c. But, link viability is not determined dynamically for call to aer_get_device_error_info() in his patchset. I suppose we could add this for CXL devices and continue to isolate the new logic from PCIe devices. Your thoughts? Link to the brief discussion with Shuai is here: https://lore.kernel.org/linux-pci/11282df5-9126-4b5b-82ae-5f1ef3b8aaf5@xxxxxxxxxxxxxxxxx/ Regards, Terry >> Regards, >> Terry