On 8/19/2022 12:35 PM, Christian König wrote:
Hi Bjorn,
Am 18.08.22 um 22:38 schrieb Bjorn Helgaas:
[Adding amdgpu folks]
On Wed, Aug 17, 2022 at 11:45:15PM +0000, bugzilla-daemon@xxxxxxxxxx
wrote:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D216373&data=05%7C01%7Clijo.lazar%40amd.com%7C59322ae65b814f132a7e08da81b14a95%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637964895716218989%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tSdOYv7x%2BO6Rm01OFSDV0j3gevlhTF9lOq9pY2AixRM%3D&reserved=0
Bug ID: 216373
Summary: Uncorrected errors reported for AMD GPU
Kernel Version: v6.0-rc1
Regression: No
...
I marked this as a regression in bugzilla.
Hardware:
CPU: Intel i7-12700K (Alder Lake)
GPU: AMD RX 6700 XT [1002:73df]
Motherboard: ASUS Prime Z690-A
Problem:
After upgrading to v6.0-rc1 the kernel is now reporting uncorrected
PCI errors
for my GPU.
Thank you very much for the report and for taking the trouble to
bisect it and test Kai-Heng's patch!
I suspect that booting with "pci=noaer" should be a temporary
workaround for this issue. If it, can you add that to the bugzilla
for anybody else who trips over this?
I have bisected this issue to:
[8795e182b02dc87e343c79e73af6b8b7f9c5e635]
PCI/portdrv: Don't disable AER reporting in get_port_device_capability()
Reverting that commit causes the errors to cease.
I suspect the errors still occur, but we just don't notice and log
them.
I have also tried Kai-Heng Feng's patch[1] which seems to resolve a
similar
problem, but it did not fix my issue.
[1]
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220706123244.18056-1-kai.heng.feng%40canonical.com%2F&data=05%7C01%7Clijo.lazar%40amd.com%7C59322ae65b814f132a7e08da81b14a95%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637964895716218989%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7U52%2BsKIHHn1%2B%2F40dbPS38IGBrBYgBxCXAoFKcrTVGU%3D&reserved=0
dmesg snippet:
pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error
received:
0000:03:00.0
amdgpu 0000:03:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
type=Transaction Layer, (Requester ID)
amdgpu 0000:03:00.0: device [1002:73df] error
status/mask=00100000/00000000
amdgpu 0000:03:00.0: [20] UnsupReq (First)
amdgpu 0000:03:00.0: AER: TLP Header: 40000001 0000000f 95e7f000
00000000
I think the TLP header decodes to:
0x40000001 = 0100 0000 ... 0000 0001 binary
0x0000000f = 0000 0000 ... 0000 1111 binary
Fmt 010b 3 DW header with data
Type 0000b 010 0 0000 MWr Memory Write Request
Length 00 0000 0001b 1 DW
Requester ID 0x0000 00:00.0
Tag 0x00
Last DW BE 0000b must be zero for 1 DW write
First DW BE 1111b all 4 bytes in DW enabled
Address 0x95e7f000
Data 0x00000000
So I think this is a 32-bit write of zero to PCI bus address
0x95e7f000.
Your dmesg log says:
pci 0000:02:00.0: PCI bridge to [bus 03]
pci 0000:02:00.0: bridge window [mem 0x95e00000-0x95ffffff]
pci 0000:03:00.0: reg 0x24: [mem 0x95e00000-0x95efffff]
[drm] register mmio base: 0x95E00000
So this looks like a write to the device's BAR 5. I don't see a PCI
reason why this should fail. Maybe there's some amdgpu reason?
Well I have seen a couple of boards where stuff like that happened, but
from my experience this always has some hardware problem as background.
From my understanding what essentially happens is that a write doesn't
make it to the device (e.g. transmission errors can't be corrected).
It's quite likely that the write is then either dropped and doesn't
matter that much (just clearing the framebuffer for example) or repeated
and because of this everything still seems to work fine.
Either way I suggest to try this with some other hartdware
configuration. E.g. put the GPU in another system and see if it still
gives the same issues or put another GPU into this system.
Or, it could be amdgpu or some other software component -
register mmio base: 0x95E00000
Address 0x95e7f000
0x95e7f000 indicates access from CPU to a register offset 0x7FE000. This
doesn't look like a valid register offset for this chip (device
[1002:73df]). Any other clues in dmesg?
Thanks,
Lijo
Regards,
Christian.
Bjorn