Re: [PATCH v7 10/17] cxl/pci: Add log message and add type check in existing RAS handlers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 12 Feb 2025 18:08:13 -0600
"Bowman, Terry" <terry.bowman@xxxxxxx> wrote:

> On 2/12/2025 4:59 PM, Dan Williams wrote:
> > Terry Bowman wrote:  
> >> The CXL RAS handlers do not currently log if the RAS registers are
> >> unmapped. This is needed in order to help debug CXL error handling. Update
> >> the CXL driver to log a warning message if the RAS register block is
> >> unmapped.
> >>
> >> Also, add type check before processing EP or RCH DP.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> >> Reviewed-by: Ira Weiny <ira.weiny@xxxxxxxxx>
> >> Reviewed-by: Gregory Price <gourry@xxxxxxxxxx>
> >> ---
> >>  drivers/cxl/core/pci.c | 20 ++++++++++++++------
> >>  1 file changed, 14 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> >> index 69bb030aa8e1..af809e7cbe3b 100644
> >> --- a/drivers/cxl/core/pci.c
> >> +++ b/drivers/cxl/core/pci.c
> >> @@ -658,15 +658,19 @@ static void __cxl_handle_cor_ras(struct device *dev,
> >>  	void __iomem *addr;
> >>  	u32 status;
> >>  
> >> -	if (!ras_base)
> >> +	if (!ras_base) {
> >> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
> >>  		return;
> >> +	}
> >>  
> >>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
> >>  	status = readl(addr);
> >> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> >> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> >> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> >> +		return;
> >> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> >> +
> >> +	if (is_cxl_memdev(dev))
> >>  		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);  
> > I think trace_cxl_aer_correctable_error() should always fire and this
> > should somehow be unified with the CPER record trace-event for protocol
> > errors.
> >
> > The only usage of @memdev in this trace is retrieving the device serial
> > number. If the device is not a memdev then print zero for the serial
> > number, or something like that.
> >
> > In the end RAS daemon should only need to enable one trace event to get
> > protocol errors and header logs from ports or endpoints, either
> > natively, or via CPER.
> >  
> That would be: we use 'struct *device' instead of 'struct *cxl_memdev'
> and pass serial# in as a parameter (0 in non-EP cases)?

For a USP may well have a serial number cap. Might be worth getting it?

slightly nasty thing here will be change of memdev=%s to devname=%s or
similar.  Meh. It's a TP_printk() I'm not sure anyone will care.
Need to be careful with the tracepoint itself though and the rasdaemon
etc handling.
https://github.com/mchehab/rasdaemon/blob/master/ras-cxl-handler.c#L351
We may have to just add another field.

> 
> >> -	}
> >>  }
> >>  
> >>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> >> @@ -702,8 +706,10 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
> >>  	u32 status;
> >>  	u32 fe;
> >>  
> >> -	if (!ras_base)
> >> +	if (!ras_base) {
> >> +		dev_warn_once(dev, "CXL RAS register block is not mapped");  
> > Is this a "never can happen" print? It seems like an oversight in an
> > upper layer to get this far down error reporting without the registers
> > mapped.
> >
> > Like maybe this is a bug in a driver that should crash, or the driver
> > should not be registering broken error handlers?  
> Correct. The error handler assignment and enablement is gated by RAS mapping
> in cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
> 
> Terry
>        
> 
> 
> 
> 






[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux