On Fri, Aug 19, 2022 at 12:13:03PM -0500, Bjorn Helgaas wrote: > On Thu, Aug 18, 2022 at 03:38:12PM -0500, Bjorn Helgaas wrote: > > [Adding amdgpu folks] > > > > On Wed, Aug 17, 2022 at 11:45:15PM +0000, bugzilla-daemon@xxxxxxxxxx wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=216373 > > > > > > Bug ID: 216373 > > > Summary: Uncorrected errors reported for AMD GPU > > > Kernel Version: v6.0-rc1 > > > Regression: No > > Tom, thanks for trying out "pci=noaer". Hopefully we won't need the > workaround for long. > > Could I trouble you to try the debug patch below and see if we get any > stack trace clues in dmesg when the error happens? I'm sure the > experts would have a better approach, but I'm amdgpu-illiterate, so > this is all I can do :) Thanks for doing this, Tom! For everybody else, Tom attached a dmesg log to the bugzilla: https://bugzilla.kernel.org/attachment.cgi?id=301606 Lots of traces of the form: amdgpu_device_wreg.part.0.cold+0xb/0x17 [amdgpu] amdgpu_gart_invalidate_tlb+0x22/0x60 [amdgpu] gmc_v10_0_hw_init+0x44/0x180 [amdgpu] amdgpu_device_wreg.part.0.cold+0xb/0x17 [amdgpu] gmc_v10_0_hw_init+0xa8/0x180 [amdgpu] amdgpu_device_wreg.part.0.cold+0xb/0x17 [amdgpu] gmc_v10_0_flush_gpu_tlb+0x35/0x280 [amdgpu] amdgpu_gart_invalidate_tlb+0x46/0x60 [amdgpu] gmc_v10_0_hw_init+0x44/0x180 [amdgpu] I tried connecting the dots but I gave up chasing all the function pointers.