Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Fri, 26 Aug 2022 09:10:50 +0200

Am 25.08.22 um 19:48 schrieb Bjorn Helgaas:
On Thu, Aug 25, 2022 at 10:18:28AM +0200, Christian König wrote:
Am 25.08.22 um 09:54 schrieb Lazar, Lijo:
On 8/25/2022 1:04 PM, Christian König wrote:
Am 25.08.22 um 08:40 schrieb Stefan Roese:
On 24.08.22 16:45, Tom Seewald wrote:
On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo
<lijo.lazar@xxxxxxx> wrote:
Unfortunately, I don't have any NV platforms to test. Attached is an
'untested-patch' based on your trace logs.
...
I did not follow this thread in depth, but FWICT the bug is solved now
with this patch. So is it correct, that the now fully enabled AER
support in the PCI subsystem in v6.0 helped detecting a bug in the AMD
GPU driver?
It looks like it, but I'm not 100% sure about the rational behind it.

Lijo can you explain more on this?
 From the trace, during gmc hw_init it takes this route -

gart_enable -> amdgpu_gtt_mgr_recover -> amdgpu_gart_invalidate_tlb ->
amdgpu_device_flush_hdp -> amdgpu_asic_flush_hdp (non-ring based HDP
flush)

HDP flush is done using remapped offset which is MMIO_REG_HOLE_OFFSET
(0x80000 - PAGE_SIZE)

WREG32_NO_KIQ((adev->rmmio_remap.reg_offset +
KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL) >> 2, 0);

However, the remapping is not yet done at this point. It's done at a
later point during common block initialization. Access to the unmapped
offset '(0x80000 - PAGE_SIZE)' seems to come back as unsupported request
and reported through AER.
That's interesting behavior. So far AER always indicated some kind of
transmission error.

When that happens as well on unmapped areas of the MMIO BAR then we need to
keep that in mind.
AER can log many different kinds of errors, some related to hardware
issues and some related to software.

PCI writes are normally posted and get no response, so AER is the main
way to find out about writes to unimplemented addresses.

Reads do get a response, of course, and reads to unimplemented
addresses cause errors that most hardware turns into a ~0 data return
(in addition to reporting via AER if enabled).

The issue is that previous hardware generations reported this through a 
device specific interrupt.

It's nice to see that this is finally standardized. I'm just wondering 
if we could retire our hardware specific interrupt handler for this as well.

Christian.

Bjorn