On 2021-09-07 12:48 p.m., Felix
Kuehling wrote:
Am 2021-09-07 um 12:07 p.m. schrieb James Zhu:Separate iommu_resume from kfd_resume, and move it before other amdgpu ip init/resume. Fixed Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211277I think the change is OK. But I don't understand how the IOMMUv2 initialization sequence could affect a crash in DM. The display should not depend on IOMMUv2 at all. What am I missing?
[JZ] It is a weird issue. disable VCN IP block or disable gpu_off feature, or set pci=noats, all
can fix DM crash. Also the issue occurred quite random, some time after few suspend/resume cycle,
some times after few hundreds S/R cycles. the maximum that I saw is 2422 S/R cycles.
But every time DM crash, I can see one or two iommu errors ahead:
AMD-Vi:
Event logged [IO_PAGE_FAULT domain=0x0000 address=****
flags=0x0070]
Since we can't stop HW/FW/SW right the way
after IO page fault detected, so I can't tell which part try to
access
system memory through IOMMU.
But after moving IOMMU device init before
other amdgpu IP init/resume, the DM crash /IOMMU page fault issues
are gone.
Those patches can't directly explain why the
issue fixed, but this new sequence makes more sense to me.
Can I have you RB on those patches?
Thanks!
James
Regards, FelixSigned-off-by: James Zhu <James.Zhu@xxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 653bd8f..e3f0308 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2393,6 +2393,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev) if (r) goto init_failed; + r = amdgpu_amdkfd_resume_iommu(adev); + if (r) + goto init_failed; + r = amdgpu_device_ip_hw_init_phase1(adev); if (r) goto init_failed; @@ -3147,6 +3151,10 @@ static int amdgpu_device_ip_resume(struct amdgpu_device *adev) { int r; + r = amdgpu_amdkfd_resume_iommu(adev); + if (r) + return r; + r = amdgpu_device_ip_resume_phase1(adev); if (r) return r; @@ -4602,6 +4610,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, dev_warn(tmp_adev->dev, "asic atom init failed!"); } else { dev_info(tmp_adev->dev, "GPU reset succeeded, trying to resume\n"); + r = amdgpu_amdkfd_resume_iommu(tmp_adev); + if (r) + goto out; + r = amdgpu_device_ip_resume_phase1(tmp_adev); if (r) goto out;