Re: [PATCH v3 7/7] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors

"Koralahalli Channabasappa, Smita" <Smita.KoralahalliChannabasappa@xxxxxxx> · Fri, 6 Dec 2024 08:29:05 -0800

On 12/2/2024 10:48 AM, Ira Weiny wrote:
Smita Koralahalli wrote:
On 11/26/2024 8:05 AM, Jonathan Cameron wrote:
On Tue, 19 Nov 2024 00:39:15 +0000
Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx> wrote:

When PCIe AER is in FW-First, OS should process CXL Protocol errors from
CPER records. Introduce support for handling and logging CXL Protocol
errors.

The defined trace events cxl_aer_uncorrectable_error and
cxl_aer_correctable_error trace native CXL AER endpoint errors, while
cxl_cper_trace_corr_prot_err and cxl_cper_trace_uncorr_prot_err
trace native CXL AER port errors. Reuse both sets to trace FW-First
protocol errors.

Since the CXL code is required to be called from process context and
GHES is in interrupt context, use workqueues for processing.

Similar to CXL CPER event handling, use kfifo to handle errors as it
simplifies queue processing by providing lock free fifo operations.

Add the ability for the CXL sub-system to register a workqueue to
process CXL CPER protocol errors.

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx>

A few minor comments inline.

Thanks

Jonathan

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 4ede038a7148..c992b34c290b 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -650,6 +650,56 @@ void read_cdat_data(struct cxl_port *port)
   }
   EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
   
+void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev, bool flag,
+				  struct cxl_ras_capability_regs ras_cap)
+{
+	struct cxl_dev_state *cxlds;
+	u32 status;
+
+	status = ras_cap.cor_status & ~ras_cap.cor_mask;
+
+	if (!flag) {

As below. Name of flag is not very helpful when reading the code.
Perhaps we can rename?

Okay. May be flag -> is_device_error ?

I had the same question about 'flag'.


+		trace_cxl_port_aer_correctable_error(&pdev->dev, status);
+		return;
+	}
+
+	cxlds = pci_get_drvdata(pdev);
+	if (!cxlds)
+		return;
+
+	trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_trace_corr_prot_err, CXL);
+
+void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev, bool flag,
+				    struct cxl_ras_capability_regs ras_cap)
+{
+	struct cxl_dev_state *cxlds;
+	u32 status, fe;
+
+	status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
+
+	if (hweight32(status) > 1)
+		fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
+				   ras_cap.cap_control));
+	else
+		fe = status;
+
+	if (!flag) {

Why does  a bool named flag indicate it's a port error?

I will rename it.

Or may be use an enum to explicitly define the error type
(CXL_ERROR_TYPE_DEVICE and CXL_ERROR_TYPE_PORT).

Or may be split the function into two distinct ones, one for port errors
and one for device errors.

I would vote for 2 functions.
Ira

Noted. Thanks!

Thanks
Smita