On 8/25/2022 1:04 PM, Christian König wrote:
Am 25.08.22 um 08:40 schrieb Stefan Roese:
On 24.08.22 16:45, Tom Seewald wrote:
On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:
Unfortunately, I don't have any NV platforms to test. Attached is an
'untested-patch' based on your trace logs.
Thanks,
Lijo
Thank you for the patch. It applied cleanly to v6.0-rc2 and after
booting that kernel I no longer see any messages about PCI errors. I
have uploaded a dmesg log to the bug report:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D301642&data=05%7C01%7Cchristian.koenig%40amd.com%7Cd55a659245b24864bd2d08da8664ae2d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637970065087671063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&sdata=vbhJ9OB0jIYr%2FRkDIbQHhRRqhyklnnHOT9Xi8z17MYY%3D&reserved=0
I did not follow this thread in depth, but FWICT the bug is solved now
with this patch. So is it correct, that the now fully enabled AER
support in the PCI subsystem in v6.0 helped detecting a bug in the AMD
GPU driver?
It looks like it, but I'm not 100% sure about the rational behind it.
Lijo can you explain more on this?
From the trace, during gmc hw_init it takes this route -
gart_enable -> amdgpu_gtt_mgr_recover -> amdgpu_gart_invalidate_tlb ->
amdgpu_device_flush_hdp -> amdgpu_asic_flush_hdp (non-ring based HDP flush)
HDP flush is done using remapped offset which is MMIO_REG_HOLE_OFFSET
(0x80000 - PAGE_SIZE)
WREG32_NO_KIQ((adev->rmmio_remap.reg_offset +
KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL) >> 2, 0);
However, the remapping is not yet done at this point. It's done at a
later point during common block initialization. Access to the unmapped
offset '(0x80000 - PAGE_SIZE)' seems to come back as unsupported request
and reported through AER.
In the patch, I just moved the remapping before gmc block initialization.
Thanks,
Lijo
Thanks,
Christian.
Thanks,
Stefan