Correction. This applies to: 8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2 On 10/25/2024 4:02 PM, Terry Bowman wrote: > This is a continuation of the CXL port error handling RFC from earlier.[1] > The RFC resulted in the decision to add CXL PCIe port error handling to > the existing RCH downstream port handling in the AER service driver. This > patchset adds the CXL PCIe port protocol error handling and logging. > > The first 7 patches update the existing AER service driver to support CXL > PCIe port protocol error handling and reporting. This includes AER service > driver changes for adding correctable and uncorrectable error support, CXL > specific recovery handling, and addition of CXL driver callback handlers. > > The following 7 patches address CXL driver support for CXL PCIe port > protocol errors. This includes the following changes to the CXL drivers: > mapping CXL port and downstream port RAS registers, interface updates for > common restricted CXL host mode (RCH) and virtual hierarchy mode (VH), > adding port specific error handlers, and protocol error logging. > > [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@xxxxxxx/ > > Testing: > > Below are test results for this patchset using Qemu with CXL root > port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream > switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are > also added to show the existing PCIe endpoint handling is not changed. > > This was tested using aer-inject updated to support CE and UCE internal > error injection. CXL RAS was set using a test patch (not upstreamed but can > provide if needed). > > - Root port UCE: > root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh > pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0 > pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0 > pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000 > pcieport 0000:0c:00.0: [22] UncorrIntErr > aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available > cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error' > Kernel panic - not syncing: CXL cachemem error. Invoking panic > CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414 > Tainted: [E]=UNSIGNED_MODULE > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > Call Trace: > <TASK> > dump_stack_lvl+0x27/0x90 > dump_stack+0x10/0x20 > panic+0x33e/0x380 > cxl_do_recovery+0x116/0x120 > ? srso_return_thunk+0x5/0x5f > aer_isr+0x3e0/0x710 > irq_thread_fn+0x28/0x70 > irq_thread+0x179/0x240 > ? srso_return_thunk+0x5/0x5f > ? __pfx_irq_thread_fn+0x10/0x10 > ? __pfx_irq_thread_dtor+0x10/0x10 > ? __pfx_irq_thread+0x10/0x10 > kthread+0xf5/0x130 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x3c/0x60 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1a/0x30 > </TASK> > Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- > > - Root port CE: > root@tbowman-cxl:~/aer-inject# ./root-c[ 191.866259] systemd-journald[482]: Sent WATCHDOG=1 notification. > e-inject.sh > pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0 > pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0 > pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000 > pcieport 0000:0c:00.0: [14] CorrIntErr > aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available > cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer' > > - Upstream switch port UCE: > root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh > pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0 > pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0 > pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000 > pcieport 0000:0d:00.0: [22] UncorrIntErr > aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available > cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error' > Kernel panic - not syncing: CXL cachemem error. Invoking panic > CPU: 1 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414 > Tainted: [E]=UNSIGNED_MODULE > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > Call Trace: > <TASK> > dump_stack_lvl+0x27/0x90 > dump_stack+0x10/0x20 > panic+0x33e/0x380 > cxl_do_recovery+0x116/0x120 > ? srso_return_thunk+0x5/0x5f > aer_isr+0x3e0/0x710 > ? free_cpumask_var+0x9/0x10 > ? kfree+0x259/0x2e0 > irq_thread_fn+0x28/0x70 > irq_thread+0x179/0x240 > ? srso_return_thunk+0x5/0x5f > ? __pfx_irq_thread_fn+0x10/0x10 > ? __pfx_irq_thread_dtor+0x10/0x10 > ? __pfx_irq_thread+0x10/0x10 > kthread+0xf5/0x130 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x3c/0x60 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1a/0x30 > </TASK> > Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- > > - Upstream switch port CE: > root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh > pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0 > pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0 > pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000 > pcieport 0000:0d:00.0: [14] CorrIntErr > aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available > cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer' > > - Downstream switch port UCE: > root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh > pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0 > pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0 > pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000 > pcieport 0000:0e:00.0: [22] UncorrIntErr > aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available > cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error' > Kernel panic - not syncing: CXL cachemem error. Invoking panic > CPU: 1 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414 > Tainted: [E]=UNSIGNED_MODULE > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > Call Trace: > <TASK> > dump_stack_lvl+0x27/0x90 > dump_stack+0x10/0x20 > panic+0x33e/0x380 > cxl_do_recovery+0x116/0x120 > ? srso_return_thunk+0x5/0x5f > aer_isr+0x3e0/0x710 > irq_thread_fn+0x28/0x70 > irq_thread+0x179/0x240 > ? srso_return_thunk+0x5/0x5f > ? __pfx_irq_thread_fn+0x10/0x10 > ? __pfx_irq_thread_dtor+0x10/0x10 > ? __pfx_irq_thread+0x10/0x10 > kthread+0xf5/0x130 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x3c/0x60 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1a/0x30 > </TASK> > Kernel Offset: 0x19c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]--- > > - Downstream switch port CE: > root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh > pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0 > pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0 > pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000 > pcieport 0000:0e:00.0: [14] CorrIntErr > aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available > cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer' > > - Endpoint CE > root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh > pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0f:00.0 > pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0 > cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID) > cxl_pci 0000:0f:00.0: device [8086:0d93] error status/mask=00000040/0000e000 > cxl_pci 0000:0f:00.0: [ 6] BadTLP > aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Bad TLP, TLP Header=Not available > cxl_aer_correctable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Received Error From Physical Layer' > > - Endpoint UCE > root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh > pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00040000 into device 0000:0f:00.0 > pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0 > cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID) > aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available > cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Memory Byte Enable Parity Error' firs' > cxl_pci 0000:0f:00.0: mem1: frozen state error detected, disable CXL.mem > cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port2 > cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port1 > pcieport 0000:0e:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0xac/0x160 > pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0) > cxl_pci 0000:0f:00.0: mem1: restart CXL.mem after slot reset > devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: mem1 dport_dev: 0000:0e:00.0 parent: 0000:0d:00.0 > devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port2:0000:0d:00.0 > devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: 0000:0e:00.0 dport_dev: 0000:0c:00.0 parent: pci0000:0c > devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port1:pci0000:0c > __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500 > cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms > __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500 > cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms > __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102 > cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms > __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102 > cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms > <snip> > cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms > __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102 > cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms > __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102 > cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms > __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102 > cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms > cxl_bus_probe: cxl_nvdimm pmem1: probe: 0 > devm_cxl_add_nvdimm: cxl_mem mem1: register pmem1 > pcieport 0000:0e:00.0: RAS is already mapped > cxl_port port2: RAS is already mapped > pcieport 0000:0c:00.0: RAS is already mapped > cxl_port_alloc: cxl_mem mem1: host-bridge: pci0000:0c > cxl_cdat_get_length: cxl_port endpoint4: CDAT length 160 > cxl_port_perf_data_calculate: cxl_port endpoint4: Failed to retrieve ep perf coordinates. > cxl_endpoint_parse_cdat: cxl_port endpoint4: Failed to do perf coord calculations. > init_hdm_decoder: cxl_port endpoint4: decoder4.0: range: 0x0-0xffffffffffffffff iw: 1 ig: 256 > add_hdm_decoder: cxl decoder4.0: Added to port endpoint4 > init_hdm_decoder: cxl_port endpoint4: decoder4.1: range: 0x0-0xffffffffffffffff iw: 1 ig: 256 > add_hdm_decoder: cxl decoder4.1: Added to port endpoint4 > init_hdm_decoder: cxl_port endpoint4: decoder4.2: range: 0x0-0xffffffffffffffff iw: 1 ig: 256 > add_hdm_decoder: cxl decoder4.2: Added to port endpoint4 > init_hdm_decoder: cxl_port endpoint4: decoder4.3: range: 0x0-0xffffffffffffffff iw: 1 ig: 256 > add_hdm_decoder: cxl decoder4.3: Added to port endpoint4 > cxl_bus_probe: cxl_port endpoint4: probe: 0 > devm_cxl_add_port: cxl_mem mem1: endpoint4 added to port2 > cxl_bus_probe: cxl_mem mem1: probe: 0 > cxl_pci 0000:0f:00.0: mem1: error resume successful > pcieport 0000:0e:00.0: AER: device recovery successful > > Changes in v1 -> v2 > [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras() > [Jonathan] Update description to DSP map patch description > [Jonathan] Update cxl_pci_port_ras() to check for NULL port > [Jonathan] Dont call handler before handler port changes are present (patch order). > [Bjorn] Fix linebreak in cover sheet URL > [Bjorn] Remove timestamps from test logs in cover sheet > [Bjorn] Retitle AER commits to use "PCI/AER:" > [Bjorn] Retitle patch#3 to use renaming instead of refactoring > [Bjorn] Fixe base commit-id on cover sheet > [Bjorn] Add VH spec reference/citation > [Terry] Removed last 2 patches to enable internal errors. Is not needed > because internal errors are enabled in AER driver. > [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers. > [Dan] Use kernel panic in CXL recovery > [Dan] cxl_port_hndlrs -> cxl_port_error_handlers > [Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration. > [Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers() > [Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter > is not used in the CXL_err_handlers callabcks. > > Changes in RFC -> v1: > [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error() > [Dan] Add cxl_do_recovery() > [Jonathan] Flatten cxl_setup_parent_uport() > [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs > [Jonathan] Rename cxl_dev_is_pci_type() > [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can > replace these find_cxl_port() and device_find_child(). > [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport() > [Ming] Dont use endpoint as host to cxl_map_component_regs() > [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE" > [Bjorn] Dont use Kconfig to enable/disable a CXL external interface > > Terry Bowman (14): > PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct > pci_driver' > PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port > support > cxl/pci: Introduce helper functions pcie_is_cxl() and > pcie_is_cxl_port() > PCI/AER: Modify AER driver logging to report CXL or PCIe bus error > type > PCI/AER: Add CXL PCIe port correctable error support in AER service > driver > PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe > port devices > PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service > driver > cxl/pci: Change find_cxl_ports() to non-static > cxl/pci: Map CXL PCIe root port and downstream switch port RAS > registers > cxl/pci: Map CXL PCIe upstream switch port RAS registers > cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port > support > cxl/pci: Add error handler for CXL PCIe port RAS errors > cxl/pci: Add trace logging for CXL PCIe port RAS errors > cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers > > drivers/cxl/core/core.h | 3 + > drivers/cxl/core/pci.c | 180 +++++++++++++++++++++++++++------- > drivers/cxl/core/port.c | 4 +- > drivers/cxl/core/trace.h | 47 +++++++++ > drivers/cxl/cxl.h | 10 +- > drivers/cxl/mem.c | 29 +++++- > drivers/pci/pci.c | 14 +++ > drivers/pci/pci.h | 3 + > drivers/pci/pcie/aer.c | 99 ++++++++++++------- > drivers/pci/pcie/err.c | 54 ++++++++++ > drivers/pci/probe.c | 10 ++ > include/linux/pci.h | 13 +++ > include/ras/ras_event.h | 9 +- > include/uapi/linux/pci_regs.h | 3 +- > 14 files changed, 396 insertions(+), 82 deletions(-) > > > base-commit: 739a5da7ed744578a9477fb322f04afecafca6b0