[Adding amdgpu folks] On Wed, Aug 17, 2022 at 11:45:15PM +0000, bugzilla-daemon@xxxxxxxxxx wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=216373 > > Bug ID: 216373 > Summary: Uncorrected errors reported for AMD GPU > Kernel Version: v6.0-rc1 > Regression: No > ... I marked this as a regression in bugzilla. > Hardware: > CPU: Intel i7-12700K (Alder Lake) > GPU: AMD RX 6700 XT [1002:73df] > Motherboard: ASUS Prime Z690-A > > Problem: > After upgrading to v6.0-rc1 the kernel is now reporting uncorrected PCI errors > for my GPU. Thank you very much for the report and for taking the trouble to bisect it and test Kai-Heng's patch! I suspect that booting with "pci=noaer" should be a temporary workaround for this issue. If it, can you add that to the bugzilla for anybody else who trips over this? > I have bisected this issue to: [8795e182b02dc87e343c79e73af6b8b7f9c5e635] > PCI/portdrv: Don't disable AER reporting in get_port_device_capability() > Reverting that commit causes the errors to cease. I suspect the errors still occur, but we just don't notice and log them. > I have also tried Kai-Heng Feng's patch[1] which seems to resolve a similar > problem, but it did not fix my issue. > > [1] > https://lore.kernel.org/linux-pci/20220706123244.18056-1-kai.heng.feng@xxxxxxxxxxxxx/ > > dmesg snippet: > > pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: > 0000:03:00.0 > amdgpu 0000:03:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), > type=Transaction Layer, (Requester ID) > amdgpu 0000:03:00.0: device [1002:73df] error status/mask=00100000/00000000 > amdgpu 0000:03:00.0: [20] UnsupReq (First) > amdgpu 0000:03:00.0: AER: TLP Header: 40000001 0000000f 95e7f000 00000000 I think the TLP header decodes to: 0x40000001 = 0100 0000 ... 0000 0001 binary 0x0000000f = 0000 0000 ... 0000 1111 binary Fmt 010b 3 DW header with data Type 0000b 010 0 0000 MWr Memory Write Request Length 00 0000 0001b 1 DW Requester ID 0x0000 00:00.0 Tag 0x00 Last DW BE 0000b must be zero for 1 DW write First DW BE 1111b all 4 bytes in DW enabled Address 0x95e7f000 Data 0x00000000 So I think this is a 32-bit write of zero to PCI bus address 0x95e7f000. Your dmesg log says: pci 0000:02:00.0: PCI bridge to [bus 03] pci 0000:02:00.0: bridge window [mem 0x95e00000-0x95ffffff] pci 0000:03:00.0: reg 0x24: [mem 0x95e00000-0x95efffff] [drm] register mmio base: 0x95E00000 So this looks like a write to the device's BAR 5. I don't see a PCI reason why this should fail. Maybe there's some amdgpu reason? Bjorn