On 11/30/2022 12:45 PM, Bjorn Helgaas wrote:
On Tue, Nov 29, 2022 at 10:49:05AM -0700, Dave Jiang wrote:
Some new devices such as CXL devices may want to record additional error
information on a corrected error. Add a callback to allow the PCI device
driver to do additional logging such as providing additional stats for user
space RAS monitoring.
For CXL device, this is actually a need due to CXL needing to write to the
device AER status register in order to clear the unmasked CEs.
s/CE/correctable error/ since it's the first use and not common in
PCI-land.
Ok
"device AER status register" sounds like the PCIe AER Correctable
Error Status Register (PCIe r6.0, sec 7.8.4.5), but I think you mean
something else, maybe a CXL-specific register?
Yes. It's part of the CXL device RAS structure. I'll add more details.
The PCIe core needs to own the AER one (PCI_ERR_COR_STATUS) so it can
coordinate ownership between firmware and Linux.
Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
Signed-off-by: Dave Jiang <dave.jiang@xxxxxxxxx>
Acked-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
Thank you!
---
Documentation/PCI/pci-error-recovery.rst | 7 +++++++
drivers/pci/pcie/aer.c | 8 +++++++-
include/linux/pci.h | 3 +++
3 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst
index 187f43a03200..690220255d5e 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -83,6 +83,7 @@ This structure has the form::
int (*mmio_enabled)(struct pci_dev *dev);
int (*slot_reset)(struct pci_dev *dev);
void (*resume)(struct pci_dev *dev);
+ void (*cor_error_log)(struct pci_dev *dev);
I think I would remove "log" from the name because it suggests this
hook should *only* log, and you need to actually clear some status.
Maybe "cor_error_detected()" to be analogous to error_detected()?
Ok I'll change.
};
The possible channel states are::
@@ -422,5 +423,11 @@ That is, the recovery API only requires that:
- drivers/net/cxgb3
- drivers/net/s2io.c
+ The cor_error_log() callback is invoked in handle_error_source() when
+ the error severity is "correctable". The callback is optional and allows
+ additional logging to be done if desired. See example:
+
+ - drivers/cxl/pci.c
+
The End
-------
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e2d8a74f83c3..af1b5eecbb11 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -961,8 +961,14 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
if (aer)
pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
info->status);
- if (pcie_aer_is_native(dev))
+ if (pcie_aer_is_native(dev)) {
+ struct pci_driver *pdrv = dev->driver;
+
+ if (pdrv && pdrv->err_handler &&
+ pdrv->err_handler->cor_error_log)
+ pdrv->err_handler->cor_error_log(dev);
pcie_clear_device_status(dev);
+ }
} else if (info->severity == AER_NONFATAL)
pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
else if (info->severity == AER_FATAL)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 575849a100a3..54939b3426a9 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -844,6 +844,9 @@ struct pci_error_handlers {
/* Device driver may resume normal operations */
void (*resume)(struct pci_dev *dev);
+
+ /* Allow device driver to record more details of a correctable error */
+ void (*cor_error_log)(struct pci_dev *dev);
};