Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging

Fan Ni <nifan.cxl@xxxxxxxxx> · Mon, 4 Nov 2024 13:48:23 -0800

On Mon, Nov 04, 2024 at 03:25:38PM -0600, Bowman, Terry wrote:
> 
> 
> On 11/1/2024 5:11 PM, Fan Ni wrote:
> > On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote:
> >> Hi Fan,
> >>
> >> I added comments below.
> >>
> >> On 11/1/2024 1:00 PM, Fan Ni wrote:
> >>> On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
> >>>> This is a continuation of the CXL port error handling RFC from earlier.[1]
> >>>> The RFC resulted in the decision to add CXL PCIe port error handling to
> >>>> the existing RCH downstream port handling in the AER service driver. This
> >>>> patchset adds the CXL PCIe port protocol error handling and logging.
> >>>>
> >>>> The first 7 patches update the existing AER service driver to support CXL
> >>>> PCIe port protocol error handling and reporting. This includes AER service
> >>>> driver changes for adding correctable and uncorrectable error support, CXL
> >>>> specific recovery handling, and addition of CXL driver callback handlers.
> >>>>
> >>>> The following 7 patches address CXL driver support for CXL PCIe port
> >>>> protocol errors. This includes the following changes to the CXL drivers:
> >>>> mapping CXL port and downstream port RAS registers, interface updates for
> >>>> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> >>>> adding port specific error handlers, and protocol error logging.
> >>>>
> >>>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@xxxxxxx/
> >>>>
> >>>> Testing:
> >>> Hi Terry,
> >>> I tried to test the patchset with aer_inject tool (with the patch you shared
> >>> in the last version), and hit some issues.
> >>> Could you help check and give some insights? Thanks.
> >>>
> >>> Below are some test setup info and results.
> >>>
> >>> I tested two topology,
> >>>   a. one memdev directly attaced to a HB with only one RP;
> >>>   b. a topology with cxl switch:
> >>>          HB
> >>>         /  \
> >>>       RP0   RP1
> >>>        |
> >>>      switch
> >>>        |
> >>>  ----------------
> >>>  |    |    |    |
> >>> mem0 mem1 mem2 mem3
> >>>
> >>> For both topologies, I cannot reproduce the system panic shown in your cover
> >>> letter.  
> >>>
> >>> btw, I tried both compile cxl as modules and in the kernel.
> >>>
> >>> Below, I will use the direct-attached topology (a) as an example to show what I
> >>> tried, hope can get some clarity about the test and what I missed or did wrong.
> >>>
> >>> -------------------------------------
> >>> pci device info on the test VM 
> >>> root@fan:~# lspci
> >>> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
> >>> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> >>> 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
> >>> 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> >>> 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> >>> 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
> >>> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
> >>> 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
> >>> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
> >>> 0c:00.0 PCI bridge: Intel Corporation Device 7075
> >>> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> >>> root@fan:~# 
> >>> -------------------------------------
> >>>
> >>> The aer injection input file looks like below,
> >>>
> >>> -------------------------------------
> >>> fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
> >>> AER
> >>> PCI_ID 0000:0c:00.0
> >>> UNCOR_STATUS INTERNAL
> >>> HEADER_LOG 0 1 2 3
> >>> ------------------------------------
> >>>
> >>> dmesg after aer injection 
> >>>
> >>> ssh root@localhost -p 2024 "dmesg"
> >>> [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> >>> [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> >>> [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> >>> [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> >>> -----------------------------------
> >> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
> >> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
> >>
> >> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
> >> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
> >> bus then the device's you test in your setup.
> >>
> >> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
> >> needed in the patchset itself (not just a test patch).
> >>
> >> Regards,
> >> Terry
> >>
> > Hi Terry, 
> >
> > I checked the two patches you attached, do we really need the first
> > patch to umask internal error? I see it is already unmasked in
> > aer_enable_internal_errors() which is called in aer_probe().
> > I tried to only apply the other patch and test again, it seems the test
> > output is the same as applying two patches. The system panics as well.
> >
> > Fan
> Hi Fan,
> 
> Which device did you inject into? RP, DSP, or USP?
> 
> Yes, the RP UIE & CIE are enabled by the AER driver. RCEC too. But, this is not done for CXL DSP
> and USP. Below are details from the spec describing how an AER error masked at the source will not
> be propagated as notification to the root complex (RP or RCEC).
> 
> 'If an individual error is masked when it is detected, its error status bit is still affected,
> but no error reporting Message is sent to the Root Complex, and the error is not recorded in the
> Header Log, TLP Prefix Log, or First Error Pointer.'[1]
> 
> [1] PCIe Spec 6.2.3.2.2 Masking Individual Errors
> 
> Also, there can be platform BIOS settings that enable/disable UIE/CIE.
> 
> Regards,
> Terry
Oh, I see. I did inject into rp in my previous setup. And confirmed we
need extra unmask for downstream port case. 

Thanks for the info.

Fan
> 

-- 
Fan Ni