On Mon, Nov 04, 2024 at 03:25:38PM -0600, Bowman, Terry wrote: > > > On 11/1/2024 5:11 PM, Fan Ni wrote: > > On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote: > >> Hi Fan, > >> > >> I added comments below. > >> > >> On 11/1/2024 1:00 PM, Fan Ni wrote: > >>> On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote: > >>>> This is a continuation of the CXL port error handling RFC from earlier.[1] > >>>> The RFC resulted in the decision to add CXL PCIe port error handling to > >>>> the existing RCH downstream port handling in the AER service driver. This > >>>> patchset adds the CXL PCIe port protocol error handling and logging. > >>>> > >>>> The first 7 patches update the existing AER service driver to support CXL > >>>> PCIe port protocol error handling and reporting. This includes AER service > >>>> driver changes for adding correctable and uncorrectable error support, CXL > >>>> specific recovery handling, and addition of CXL driver callback handlers. > >>>> > >>>> The following 7 patches address CXL driver support for CXL PCIe port > >>>> protocol errors. This includes the following changes to the CXL drivers: > >>>> mapping CXL port and downstream port RAS registers, interface updates for > >>>> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH), > >>>> adding port specific error handlers, and protocol error logging. > >>>> > >>>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@xxxxxxx/ > >>>> > >>>> Testing: > >>> Hi Terry, > >>> I tried to test the patchset with aer_inject tool (with the patch you shared > >>> in the last version), and hit some issues. > >>> Could you help check and give some insights? Thanks. > >>> > >>> Below are some test setup info and results. > >>> > >>> I tested two topology, > >>> a. one memdev directly attaced to a HB with only one RP; > >>> b. a topology with cxl switch: > >>> HB > >>> / \ > >>> RP0 RP1 > >>> | > >>> switch > >>> | > >>> ---------------- > >>> | | | | > >>> mem0 mem1 mem2 mem3 > >>> > >>> For both topologies, I cannot reproduce the system panic shown in your cover > >>> letter. > >>> > >>> btw, I tried both compile cxl as modules and in the kernel. > >>> > >>> Below, I will use the direct-attached topology (a) as an example to show what I > >>> tried, hope can get some clarity about the test and what I missed or did wrong. > >>> > >>> ------------------------------------- > >>> pci device info on the test VM > >>> root@fan:~# lspci > >>> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller > >>> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02) > >>> 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03) > >>> 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem > >>> 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem > >>> 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge > >>> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02) > >>> 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02) > >>> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02) > >>> 0c:00.0 PCI bridge: Intel Corporation Device 7075 > >>> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01) > >>> root@fan:~# > >>> ------------------------------------- > >>> > >>> The aer injection input file looks like below, > >>> > >>> ------------------------------------- > >>> fan:~/cxl/cxl-test-tool$ cat /tmp/internal > >>> AER > >>> PCI_ID 0000:0c:00.0 > >>> UNCOR_STATUS INTERNAL > >>> HEADER_LOG 0 1 2 3 > >>> ------------------------------------ > >>> > >>> dmesg after aer injection > >>> > >>> ssh root@localhost -p 2024 "dmesg" > >>> [ 613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0 > >>> [ 613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0 > >>> [ 613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > >>> [ 613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing. > >>> ----------------------------------- > >> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic. > >> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set. > >> > >> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting. > >> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different > >> bus then the device's you test in your setup. > >> > >> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is > >> needed in the patchset itself (not just a test patch). > >> > >> Regards, > >> Terry > >> > > Hi Terry, > > > > I checked the two patches you attached, do we really need the first > > patch to umask internal error? I see it is already unmasked in > > aer_enable_internal_errors() which is called in aer_probe(). > > I tried to only apply the other patch and test again, it seems the test > > output is the same as applying two patches. The system panics as well. > > > > Fan > Hi Fan, > > Which device did you inject into? RP, DSP, or USP? > > Yes, the RP UIE & CIE are enabled by the AER driver. RCEC too. But, this is not done for CXL DSP > and USP. Below are details from the spec describing how an AER error masked at the source will not > be propagated as notification to the root complex (RP or RCEC). > > 'If an individual error is masked when it is detected, its error status bit is still affected, > but no error reporting Message is sent to the Root Complex, and the error is not recorded in the > Header Log, TLP Prefix Log, or First Error Pointer.'[1] > > [1] PCIe Spec 6.2.3.2.2 Masking Individual Errors > > Also, there can be platform BIOS settings that enable/disable UIE/CIE. > > Regards, > Terry Oh, I see. I did inject into rp in my previous setup. And confirmed we need extra unmask for downstream port case. Thanks for the info. Fan > -- Fan Ni