Ira Weiny wrote: > Smita Koralahalli wrote: > > When PCIe AER is in FW-First, OS should process CXL Protocol errors from > > CPER records. Introduce support for handling and logging CXL Protocol > > errors. > > > > The defined trace events cxl_aer_uncorrectable_error and > > cxl_aer_correctable_error trace native CXL AER endpoint errors. Reuse them > > to trace FW-First Protocol errors. > > > > Since the CXL code is required to be called from process context and > > GHES is in interrupt context, use workqueues for processing. > > > > Similar to CXL CPER event handling, use kfifo to handle errors as it > > simplifies queue processing by providing lock free fifo operations. > > > > Add the ability for the CXL sub-system to register a workqueue to > > process CXL CPER protocol errors. > > > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx> > > --- > > Comments: There is a potential failure case, and I am seeking feedback. > > > > If a CXL Protocol Error occurs during boot: Both acpi_ghes_init() and > > cxl_core_init() are subsys_initcall. GHES might detect the error and > > trigger cxl_cper_post_prot_err() even before CXL device is completely > > enumerated. (i.e pdev might return NULL OR pdev might succeed and cxlds > > might be NULL as cxl_pci driver is not loaded.) > > > > I don't think this is something we should be overly concerned about. > If protocol errors are occurring that early then they are very likely to > be bad hardware which is going to happen once the subsystems are brought > up and start working with the devices. So new errors will alert the user. > > > Usage of delayed_workqueue(): Would delaying the handling/logging of > > errors, particularly uncorrectable errors, be acceptable? > > Any alternative suggestions for addressing this issue would be greatly > > appreciated. > > > > Tony questioned choosing value 8 for FIFO_DEPTH in v6. That was just a > > random value that I picked. I would appreciate any suggestions in > > considering the appropriate value for number of entries. > > FWIW I think this is fine until someone sees a reason to increase it. > > [snip] > > > + > > +static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev, > > + struct cxl_ras_capability_regs ras_cap) > > +{ > > + u32 status = ras_cap.cor_status & ~ras_cap.cor_mask; > > + struct cxl_dev_state *cxlds; > > + > > + cxlds = pci_get_drvdata(pdev); > > + if (!cxlds) > > + return; > > + > > + trace_cxl_aer_correctable_error(cxlds->cxlmd, status); > > I dug into this just a bit wondering if passing cxl_memdev is the best way > for this tracepoint to work given the type 2 work... > > + Alejandro > > For now I think this patch is fine. So. > > Reviewed-by: Ira Weiny <ira.weiny@xxxxxxxxx> Unfortunately I missed the fact that this breaks the test build. ras.o needs to be in the test build files as well. Ira diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild index ef10a896a384..4efcc0606bd6 100644 --- a/tools/testing/cxl/Kbuild +++ b/tools/testing/cxl/Kbuild @@ -62,6 +62,7 @@ cxl_core-y += $(CXL_CORE_SRC)/hdm.o cxl_core-y += $(CXL_CORE_SRC)/pmu.o cxl_core-y += $(CXL_CORE_SRC)/cdat.o cxl_core-y += $(CXL_CORE_SRC)/acpi.o +cxl_core-y += $(CXL_CORE_SRC)/ras.o cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o cxl_core-$(CONFIG_CXL_FEATURES) += $(CXL_CORE_SRC)/features.o