Re: [PATCH v2 1/3] cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers

Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx> · Tue, 8 Aug 2023 12:37:57 -0700

On 8/7/2023 8:17 PM, Dan Williams wrote:
Smita Koralahalli wrote:
According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
of AER should also own CXL Protocol Error Management as there is no
explicit control of CXL Protocol error. And the CXL RAS Cap registers
reported on Protocol errors should check for AER _OSC rather than CXL
Memory Error Reporting Control _OSC.

The CXL Memory Error Reporting Control _OSC specifically highlights
handling Memory Error Logging and Signaling Enhancements. These kinds of
errors are reported through a device's mailbox and can be managed
independently from CXL Protocol Errors.

This change fixes handling and reporting CXL Protocol Errors and RAS
registers natively with native AER and FW-First CXL Memory Error Reporting
Control.

I feel like this could be said more succinctly and with an indication of
what the end user should expect to see. Something like:

"cxl_pci fails to unmask CXL protocol errors when CXL memory error
reporting is not granted native control. Given that CXL memory error
reporting uses the event interface and protocol errors use AER, unmask
protocol errors based only on the native AER setting. Without this
change end user deployments will fail to report protocol errors in the
case where native memory error handling is not granted to Linux."

Sure, will make the change for a more clearer description. Thanks!


[1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.

Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx>
---
v2:
	Added fixes tag.
	Included what the patch fixes in commit message.
---
  drivers/cxl/pci.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 1cb1494c28fe..2323169b6e5f 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
  		return 0;
  	}
  
-	/* BIOS has CXL error control */
-	if (!host_bridge->native_cxl_error)
-		return -ENXIO;
+	/* BIOS has PCIe AER error control */
+	if (!host_bridge->native_aer)
+		return 0;

The error code does not matter here and changing it makes the patch that
bit much more noisier than it needs to be. So just leave it as:

Doing this will return an error from cxl_pci probe thereby failing the 
device node creation in FW-First AER/DPC. I cannot think of other places 
where we reference the device node in FW-First mode but I have a place 
where this could potentially be a roadblock.

I'm trying to add trace events support for FW-First Protocol Errors. 
https://lore.kernel.org/linux-cxl/D9381C12-A585-4089-873B-3707C17823D3@xxxxxx/T/#mcaf8a78c1295372ab811be7e1ccb6a8a4d99f3e9

And we already have an existing trace_cxl_aer_correctable_error() and 
similarly for uncorrectable error for native protocol error reporting. I 
was trying to reuse the same function for fw-first as well. This 
function references cxl memory device node which will be NULL in 
FW-First if this returns an error.

I don't mind having a separate trace event function for FW-First mode as 
it would simplify things especially when dealing with RCH DP.. But there 
may be other potential places where we might reference this device node 
in FW-First. Please advice.

Thanks,
Smita


	return -ENXIO;

  
  	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
  	if (rc)
--
2.17.1