Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Correction. This applies to the following base commit:

8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2


On 10/25/2024 4:02 PM, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling in the AER service driver. This
> patchset adds the CXL PCIe port protocol error handling and logging.
>
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
>
> The following 7 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> adding port specific error handlers, and protocol error logging.
>
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@xxxxxxx/
>
> Testing:
>
> Below are test results for this patchset using Qemu with CXL root
> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> also added to show the existing PCIe endpoint handling is not changed.
>
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed but can
> provide if needed).
>
>  - Root port UCE:
>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Root port CE:
>  root@tbowman-cxl:~/aer-inject# ./root-c[  191.866259] systemd-journald[482]: Sent WATCHDOG=1 notification.
>  e-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
>  pcieport 0000:0c:00.0:    [14] CorrIntErr
>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
>
>  - Upstream switch port UCE:
>  root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
>  pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
>  pcieport 0000:0d:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   ? free_cpumask_var+0x9/0x10
>   ? kfree+0x259/0x2e0
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Upstream switch port CE:
>  root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
>  pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
>  pcieport 0000:0d:00.0:    [14] CorrIntErr
>  aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
>
>  - Downstream switch port UCE:
>  root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
>  pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
>  pcieport 0000:0e:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x19c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Downstream switch port CE:
>  root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
>  pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
>  pcieport 0000:0e:00.0:    [14] CorrIntErr
>  aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
>
>  - Endpoint CE
>  root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0f:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
>  cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
>  cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00000040/0000e000
>  cxl_pci 0000:0f:00.0:    [ 6] BadTLP
>  aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Bad TLP, TLP Header=Not available
>  cxl_aer_correctable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Received Error From Physical Layer'
>
>  - Endpoint UCE
>  root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00040000 into device 0000:0f:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
>  cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
>  aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
>  cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Memory Byte Enable Parity Error' firs'
>  cxl_pci 0000:0f:00.0: mem1: frozen state error detected, disable CXL.mem
>  cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port2
>  cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port1
>  pcieport 0000:0e:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0xac/0x160
>  pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
>  cxl_pci 0000:0f:00.0: mem1: restart CXL.mem after slot reset
>  devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: mem1 dport_dev: 0000:0e:00.0 parent: 0000:0d:00.0
>  devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port2:0000:0d:00.0
>  devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: 0000:0e:00.0 dport_dev: 0000:0c:00.0 parent: pci0000:0c
>  devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port1:pci0000:0c
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  <snip>
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  cxl_bus_probe: cxl_nvdimm pmem1: probe: 0
>  devm_cxl_add_nvdimm: cxl_mem mem1: register pmem1
>  pcieport 0000:0e:00.0: RAS is already mapped
>  cxl_port port2: RAS is already mapped
>  pcieport 0000:0c:00.0: RAS is already mapped
>  cxl_port_alloc: cxl_mem mem1: host-bridge: pci0000:0c
>  cxl_cdat_get_length: cxl_port endpoint4: CDAT length 160
>  cxl_port_perf_data_calculate: cxl_port endpoint4: Failed to retrieve ep perf coordinates.
>  cxl_endpoint_parse_cdat: cxl_port endpoint4: Failed to do perf coord calculations.
>  init_hdm_decoder: cxl_port endpoint4: decoder4.0: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.0: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.1: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.1: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.2: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.2: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.3: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.3: Added to port endpoint4
>  cxl_bus_probe: cxl_port endpoint4: probe: 0
>  devm_cxl_add_port: cxl_mem mem1: endpoint4 added to port2
>  cxl_bus_probe: cxl_mem mem1: probe: 0
>  cxl_pci 0000:0f:00.0: mem1: error resume successful
>  pcieport 0000:0e:00.0: AER: device recovery successful
>
>  Changes in v1 -> v2
>  [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
>  [Jonathan] Update description to DSP map patch description
>  [Jonathan] Update cxl_pci_port_ras() to check for NULL port
>  [Jonathan] Dont call handler before handler port changes are present (patch order).
>  [Bjorn] Fix linebreak in cover sheet URL
>  [Bjorn] Remove timestamps from test logs in cover sheet
>  [Bjorn] Retitle AER commits to use "PCI/AER:"
>  [Bjorn] Retitle patch#3 to use renaming instead of refactoring
>  [Bjorn] Fixe base commit-id on cover sheet
>  [Bjorn] Add VH spec reference/citation
>  [Terry] Removed last 2 patches to enable internal errors. Is not needed
>  because internal errors are enabled in AER driver.
>  [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
>  [Dan] Use kernel panic in CXL recovery
>  [Dan] cxl_port_hndlrs -> cxl_port_error_handlers
>  [Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
>  [Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
>  [Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
>  is not used in the CXL_err_handlers callabcks.
>
> Changes in RFC -> v1:
>  [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
>  [Dan] Add cxl_do_recovery()
>  [Jonathan] Flatten cxl_setup_parent_uport()
>  [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
>  [Jonathan] Rename cxl_dev_is_pci_type()
>  [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
>  replace these find_cxl_port() and device_find_child().
>  [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
>  [Ming] Dont use endpoint as host to cxl_map_component_regs()
>  [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
>  [Bjorn] Dont use Kconfig to enable/disable a CXL external interface
>
> Terry Bowman (14):
>   PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
>     pci_driver'
>   PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port
>     support
>   cxl/pci: Introduce helper functions pcie_is_cxl() and
>     pcie_is_cxl_port()
>   PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
>     type
>   PCI/AER: Add CXL PCIe port correctable error support in AER service
>     driver
>   PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
>     port devices
>   PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service
>     driver
>   cxl/pci: Change find_cxl_ports() to non-static
>   cxl/pci: Map CXL PCIe root port and downstream switch port RAS
>     registers
>   cxl/pci: Map CXL PCIe upstream switch port RAS registers
>   cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port
>     support
>   cxl/pci: Add error handler for CXL PCIe port RAS errors
>   cxl/pci: Add trace logging for CXL PCIe port RAS errors
>   cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
>
>  drivers/cxl/core/core.h       |   3 +
>  drivers/cxl/core/pci.c        | 180 +++++++++++++++++++++++++++-------
>  drivers/cxl/core/port.c       |   4 +-
>  drivers/cxl/core/trace.h      |  47 +++++++++
>  drivers/cxl/cxl.h             |  10 +-
>  drivers/cxl/mem.c             |  29 +++++-
>  drivers/pci/pci.c             |  14 +++
>  drivers/pci/pci.h             |   3 +
>  drivers/pci/pcie/aer.c        |  99 ++++++++++++-------
>  drivers/pci/pcie/err.c        |  54 ++++++++++
>  drivers/pci/probe.c           |  10 ++
>  include/linux/pci.h           |  13 +++
>  include/ras/ras_event.h       |   9 +-
>  include/uapi/linux/pci_regs.h |   3 +-
>  14 files changed, 396 insertions(+), 82 deletions(-)
>
>
> base-commit: 739a5da7ed744578a9477fb322f04afecafca6b0





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux