On 2/12/2025 4:59 PM, Dan Williams wrote: > Terry Bowman wrote: >> The CXL RAS handlers do not currently log if the RAS registers are >> unmapped. This is needed in order to help debug CXL error handling. Update >> the CXL driver to log a warning message if the RAS register block is >> unmapped. >> >> Also, add type check before processing EP or RCH DP. >> >> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> >> Reviewed-by: Ira Weiny <ira.weiny@xxxxxxxxx> >> Reviewed-by: Gregory Price <gourry@xxxxxxxxxx> >> --- >> drivers/cxl/core/pci.c | 20 ++++++++++++++------ >> 1 file changed, 14 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c >> index 69bb030aa8e1..af809e7cbe3b 100644 >> --- a/drivers/cxl/core/pci.c >> +++ b/drivers/cxl/core/pci.c >> @@ -658,15 +658,19 @@ static void __cxl_handle_cor_ras(struct device *dev, >> void __iomem *addr; >> u32 status; >> >> - if (!ras_base) >> + if (!ras_base) { >> + dev_warn_once(dev, "CXL RAS register block is not mapped"); >> return; >> + } >> >> addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET; >> status = readl(addr); >> - if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) { >> - writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr); >> + if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK)) >> + return; >> + writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr); >> + >> + if (is_cxl_memdev(dev)) >> trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status); > I think trace_cxl_aer_correctable_error() should always fire and this > should somehow be unified with the CPER record trace-event for protocol > errors. > > The only usage of @memdev in this trace is retrieving the device serial > number. If the device is not a memdev then print zero for the serial > number, or something like that. > > In the end RAS daemon should only need to enable one trace event to get > protocol errors and header logs from ports or endpoints, either > natively, or via CPER. > That would be: we use 'struct *device' instead of 'struct *cxl_memdev' and pass serial# in as a parameter (0 in non-EP cases)? >> - } >> } >> >> static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds) >> @@ -702,8 +706,10 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base) >> u32 status; >> u32 fe; >> >> - if (!ras_base) >> + if (!ras_base) { >> + dev_warn_once(dev, "CXL RAS register block is not mapped"); > Is this a "never can happen" print? It seems like an oversight in an > upper layer to get this far down error reporting without the registers > mapped. > > Like maybe this is a bug in a driver that should crash, or the driver > should not be registering broken error handlers? Correct. The error handler assignment and enablement is gated by RAS mapping in cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting(). Terry