Re: [PATCH v3 06/15] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices

"Bowman, Terry" <terry.bowman@xxxxxxx> · Wed, 27 Nov 2024 14:53:32 -0600

On 11/27/2024 11:05 AM, Jonathan Cameron wrote:
> On Thu, 21 Nov 2024 14:24:17 -0600
> "Bowman, Terry" <terry.bowman@xxxxxxx> wrote:
>
>> On 11/15/2024 3:35 AM, Lukas Wunner wrote:
>>> On Wed, Nov 13, 2024 at 03:54:20PM -0600, Terry Bowman wrote:  
>>>> The AER service driver's aer_get_device_error_info() function doesn't read
>>>> uncorrectable (UCE) fatal error status from PCIe upstream port devices,
>>>> including CXL upstream switch ports. As a result, fatal errors are not
>>>> logged or handled as needed for CXL PCIe upstream switch port devices.
>>>>
>>>> Update the aer_get_device_error_info() function to read the UCE fatal
>>>> status for all CXL PCIe port devices. Make the change to not affect
>>>> non-CXL PCIe devices.
>>>>
>>>> The fatal error status will be used in future patches implementing
>>>> CXL PCIe port uncorrectable error handling and logging.  
>>> [...]  
>>>> --- a/drivers/pci/pcie/aer.c
>>>> +++ b/drivers/pci/pcie/aer.c
>>>> @@ -1250,7 +1250,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>>>>  	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>>>  		   type == PCI_EXP_TYPE_RC_EC ||
>>>>  		   type == PCI_EXP_TYPE_DOWNSTREAM ||
>>>> -		   info->severity == AER_NONFATAL) {
>>>> +		   info->severity == AER_NONFATAL ||
>>>> +		   (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
>>>>  
>>>>  		/* Link is still healthy for IO reads */
>>>>  		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,  
>>> Just a heads-up, there's another patch pending by Shuai Xue (+cc)
>>> which touches the same code lines.  It re-enables error reporting
>>> for PCIe Upstream Ports (as well as Endpoints) under certain
>>> conditions:
>>>
>>> https://lore.kernel.org/all/20241112135419.59491-3-xueshuai@xxxxxxxxxxxxxxxxx/
>>>
>>> That was originally disabled by Keith Busch (+cc) with commit
>>> 9d938ea53b26 ("PCI/AER: Don't read upstream ports below fatal errors").
>>>
>>> There's some merge conflict potential here if your series goes into
>>> the cxl tree and Shuai's patch into the pci tree in the next cycle.
>>>
>>> Thanks,
>>>
>>> Lukas  
>> Thanks Lukas I took a look at the patchset and reached out to Shuai (you're CC'd). Sorry, I thought
>> I responded here earlier.
> I'm guessing we might not need this change if we can base querying on the
> link being good.  If the error is on the CXL protocol side, the link should
> still be fine I think?
>
> Jonathan

Hi Jonathan,

Shuai is determining upstream link viability using a call to pciehp_check_link_active() in dpc.c. But, link viability is not determined dynamically for call to aer_get_device_error_info() in his patchset. I suppose we could add this for CXL devices and continue to isolate the new logic from PCIe devices. Your thoughts?
Link to the brief discussion with Shuai is here: https://lore.kernel.org/linux-pci/11282df5-9126-4b5b-82ae-5f1ef3b8aaf5@xxxxxxxxxxxxxxxxx/ Regards, Terry
>> Regards,
>> Terry