Re: [PATCH v6] cxl: add RAS status unmasking for CXL

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2/10/23 5:11 PM, Bjorn Helgaas wrote:
On Fri, Feb 10, 2023 at 04:46:15PM -0700, Dave Jiang wrote:
On 2/10/23 3:52 PM, Bjorn Helgaas wrote:
On Fri, Feb 10, 2023 at 10:04:03AM -0700, Dave Jiang wrote:
By default the CXL RAS mask registers bits are defaulted to 1's and
suppress all error reporting. If the kernel has negotiated ownership
of error handling for CXL then unmask the mask registers by writing 0s.

PCI_EXP_AER_FLAGS moved to linux/pci.h header to expose to driver. It
allows exposure of system enabled PCI error flags for the driver to decide
which error bits to toggle. Bjorn suggested that the error enabling should
be controlled from the system policy rather than a driver level choice[1].

CXL RAS CE and UE masks are checked against PCI_EXP_AER_FLAGS before
unmasking.

[1]: https://lore.kernel.org/linux-cxl/20230210122952.00006999@xxxxxxxxxx/T/#me8c7f39d43029c64ccff5c950b78a2cee8e885af

+static int cxl_pci_ras_unmask(struct pci_dev *pdev)
+{
+	struct pci_host_bridge *host_bridge = pci_find_host_bridge(pdev->bus);
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+	void __iomem *addr;
+	u32 orig_val, val, mask;
+
+	if (!cxlds->regs.ras)
+		return -ENODEV;
+
+	/* BIOS has CXL error control */
+	if (!host_bridge->native_cxl_error)
+		return -EOPNOTSUPP;
+
+	if (PCI_EXP_AER_FLAGS & PCI_EXP_DEVCTL_URRE) {

1) I don't really want to expose PCI_EXP_AER_FLAGS in linux/pci.h.
It's basically a convenience part of the AER implementation.

2) I think your intent here is to configure the CXL RAS masking based
on what PCIe error reporting is enabled, but doing it by looking at
PCI_EXP_AER_FLAGS doesn't seem right.  This expression is a
compile-time constant that is always true, but we can't rely on
devices always being configured that way.

We call pci_aer_init() for every device during enumeration, but we
only write PCI_EXP_AER_FLAGS if pci_aer_available() and if
pcie_aer_is_native().  And there are a bunch of drivers that call
pci_disable_pcie_error_reporting(), which *clears* those flags.  I'm
not sure those drivers *should* be doing that, but they do today.

I'm not sure why this needs to be conditional at all, but if it does,
maybe you want to read PCI_EXP_DEVCTL and base it on that?

Ok I'll read the PCI_EXP_DEVCTL. Looking to only unmask the relevant RAS
reporting if respective PCIe bits are enabled.

That sounds OK to me, but leaves the question of those drivers that
call pci_disable_pcie_error_reporting() because CXL won't know about
that.  But maybe that's not a problem, I dunno.

Currently the CXL subsystem covers the type-3 devices so I don't think it'll be an issue. type-2 may be an issue but it doesn't go through the current driver. Maybe we'll figure out how to deal with that when those show device drivers show up.



I see you're just adding a check of return value here, but I'm not
sure you need to call pci_enable_pcie_error_reporting() in the first
place.  Isn't the call in the pci_aer_init() path enough?

I guess I'm confused by the kernel documentation:
"
pci_enable_pcie_error_reporting enables the device to send error
messages to root port when an error is detected. Note that devices
don't enable the error reporting by default, so device drivers need
call this function to enable it.
"

Seems to indicate that driver should always call this if it wants AER
reporting?

Oh, thanks for pointing that out!  I'll update that doc to match the
current code, which *does* enable reporting by default:

Ah ok. I shall remove the calling of pci_enable_pcie_error_reporting.


commit f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is native")
Author: Stefan Roese <sr@xxxxxxx>
Date:   Tue Jan 25 08:18:20 2022 +0100

     PCI/AER: Enable error reporting when AER is native

     If we have native control of AER, set the following error reporting enable
     bits:

       - Correctable Error Reporting Enable
       - Non-Fatal Error Reporting Enable
       - Fatal Error Reporting Enable
       - Unsupported Request Reporting Enable

     Note that these bits are all in the Device Control register and are not
     AER-specific.

     This affects all devices with an AER capability, including hot-added
     devices.

     Please note that this change is quite invasive, as error reporting now will
     be enabled for all available PCIe Endpoints, which was previously not the
     case.

     When "pci=noaer" is selected, error reporting stays disabled of course.



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux