Re: [v6 11/11 PATCH] cxl/pci: Add callback to log AER correctable error

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Wed, 7 Dec 2022 14:29:20 -0600

Hi Terry,

On Wed, Dec 07, 2022 at 02:04:17PM -0600, Terry Bowman wrote:
> On 11/30/22 18:02, Dave Jiang wrote:
> > Add AER error handler callback to read the RAS capability structure
> > correctable error (CE) status register for the CXL device. Log the
> > error as a trace event and clear the error. For CXL devices, the driver
> > also needs to write back to the status register to clear the
> > unmasked correctable errors.
> > 
> > See CXL spec rev3.0 8.2.4.16 for RAS capability structure CE Status
> > Register.
> > 
> > Suggested-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> > Signed-off-by: Dave Jiang <dave.jiang@xxxxxxxxx>
> > ---
> > 
> > v6:
> > - Update commit log to point to RAS capability structure. (Bjorn)
> > - Change cxl_correctable_error_logging() to cxl_cor_error_detected().
> >   (Bjorn)
> > 
> >  drivers/cxl/pci.c |   20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> > 
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 11f842df9807..02342830b612 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -622,10 +622,30 @@ static void cxl_error_resume(struct pci_dev *pdev)
> >  		 dev->driver ? "successful" : "failed");
> >  }
> >  
> > +static void cxl_cor_error_detected(struct pci_dev *pdev)
> > +{
> > +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> > +	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> > +	struct device *dev = &cxlmd->dev;
> > +	void __iomem *addr;
> > +	u32 status;
> > +
> > +	if (!cxlds->regs.ras)
> > +		return;
> > +
> > +	addr = cxlds->regs.ras + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
> > +	status = le32_to_cpu(readl(addr));
> > +	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> > +		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> > +		trace_cxl_aer_correctable_error(dev_name(dev), status);
> > +	}
> > +}
> > +
> 
> This will log PCI AER CEs only if there is also a RAS CE. My
> understanding (could be the problem) is AER CE's are normally
> reported. Will this be inconsistent with other error AER CE
> handling?

I can't quite parse this, so let me ramble and see if we accidentally
converge on some understanding :)

cxl_cor_error_detected() is the .cor_error_detected handler, which is
called by the AER code in the PCI core.  So IIUC, we'll only get here
if that PCI core AER code is invoked via an AER interrupt, AER
polling, or an event from the ACPI APEI framework.

So I would expect "this will only log CXL RAS CEs if there is a PCI
AER CE", which is the opposite of what you said.  But I have no idea
at all about how CXL RAS CEs work.

It looks like aer_enable_rootport() sets PCI_ERR_ROOT_CMD_COR_EN, so I
would expect that AER CEs normally cause interrupts and would be
discovered that way.

> >  static const struct pci_error_handlers cxl_error_handlers = {
> >  	.error_detected	= cxl_error_detected,
> >  	.slot_reset	= cxl_slot_reset,
> >  	.resume		= cxl_error_resume,
> > +	.cor_error_detected	= cxl_cor_error_detected,
> >  };
> >  
> >  static struct pci_driver cxl_pci_driver = {
> > 
> > 
>