Re: [PATCH 0/8] cxl/pci: Add fundamental error handling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 15, 2022 at 9:14 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>
> Add a 'struct pci_error_handlers' instance for the cxl_pci driver.
> Section 8.2.5.9 "CXL RAS Capability Structure" of the CXL 2.0
> specification defines the error sources considered in this
> implementation. The RAS Capability Structure defines protocol, link and
> internal errors which are distinct from memory poison errors that are
> conveyed via direct consumption and/or media scanning.
>
> The errors reported by the RAS registers are categorized into
> correctable and uncorrectable errors, where the uncorrectable errors are
> optionally steered to either fatal or non-fatal AER events. Table 224
> "Device Specific Error Reporting and Nomenclature Guidelines" in the CXL
> 2.0 specification outlines that the remediation for uncorrectable errors
> is a reset to recover. This matches how the Linux PCIe AER core treats
> uncorrectable errors as occasions to reset the device to recover
> operation.
>
> While the specification notes "CXL Reset" or "Secondary Bus Reset" as
> theoretical recovery options, they are not feasible in practice since
> in-flight CXL.mem operations may not terminate and cause knock-on system
> fatal events. Reset is only reliable for recovering CXL.io, it is not
> reliable for recovering CXL.mem. Assuming the system survives, a reset
> causes CXL.mem operation to restart from scratch.
>
> The "ECN: Error Isolation on CXL.mem and CXL.cache" [1] document
> recognizes the CXL Reset vs CXL.mem operational conflict and helps to at
> least provide a mechanism for the Root Port to terminate in flight
> CXL.mem operations with completions. That still poses problems in
> practice if the kernel is running out of "System RAM" backed by the CXL
> device and poison is used to convey the data lost to the protocol error.
>
> Regardless of whether the reset and restart of CXL.mem operations is
> feasible / successful, the logging is still useful. So, the
> implementation reads, reports, and clears the status in the RAS
> Capability Structure registers, and it notifies the 'struct cxl_memdev'
> associated with the given PCIe endpoint to reattach to its driver over
> the reset so that the HDM decoder configuration can be reconstructed.
>
> The first half of the series reworks component register mapping so that
> the cxl_pci driver can own the RAS Capability while the cxl_port driver
> continues to own the HDM Decoder Capability. The last half implements
> the RAS Capability Structure mapping and reporting via 'struct
> pci_error_handlers'.
>
> [1]: https://www.computeexpresslink.org/spec-landing
>
> ---
>
>
> Dan Williams (8):
>       cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers
>       cxl/pci: Cleanup cxl_map_device_regs()
>       cxl/pci: Kill cxl_map_regs()
>       cxl/core/regs: Make cxl_map_{component,device}_regs() device generic
>       cxl/port: Limit the port driver to just the HDM Decoder Capability
>       cxl/pci: Prepare for mapping RAS Capability Structure
>       cxl/pci: Find and map the RAS Capability Structure
>       cxl/pci: Add (hopeful) error handling support
>
>
>  drivers/cxl/core/hdm.c    |   33 +++++----
>  drivers/cxl/core/memdev.c |    1
>  drivers/cxl/core/pci.c    |    3 -
>  drivers/cxl/core/port.c   |    2 -
>  drivers/cxl/core/regs.c   |  172 ++++++++++++++++++++++++++-------------------
>  drivers/cxl/cxl.h         |   36 +++++++--
>  drivers/cxl/cxlmem.h      |    2 +
>  drivers/cxl/cxlpci.h      |    9 --
>  drivers/cxl/pci.c         |  163 ++++++++++++++++++++++++++++++++-----------
>  9 files changed, 273 insertions(+), 148 deletions(-)
>
> base-commit: 74be98774dfbc5b8b795db726bd772e735d2edd4

Apologies, wrong base-commit, this series is based on that commit + this series:

https://lore.kernel.org/linux-cxl/164730733718.3806189.9721916820488234094.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux