Re: [PATCH 3/3] drm/amdgpu: move iommu_resume before ip init/resume

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2021-09-07 12:48 p.m., Felix Kuehling wrote:
Am 2021-09-07 um 12:07 p.m. schrieb James Zhu:
Separate iommu_resume from kfd_resume, and move it before
other amdgpu ip init/resume.

Fixed Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211277
I think the change is OK. But I don't understand how the IOMMUv2
initialization sequence could affect a crash in DM. The display should
not depend on IOMMUv2 at all. What am I missing?

[JZ] It is a weird issue. disable VCN IP block or disable gpu_off feature, or set pci=noats, all

can fix DM crash. Also the issue occurred quite random, some time after few suspend/resume cycle,

some times after few hundreds S/R cycles. the maximum that I saw is 2422 S/R cycles.

But every time DM crash, I can see one or two iommu errors ahead:

AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=**** flags=0x0070]
Since we can't stop HW/FW/SW right the way after IO page fault detected, so I can't tell which part try to access
system memory through IOMMU.

But after moving IOMMU device init before other amdgpu IP init/resume, the DM crash /IOMMU page fault issues are gone.

Those patches can't directly explain why the issue fixed, but this new sequence makes more sense to me.

Can I have you RB on those patches?

Thanks!
James


Regards,
  Felix


Signed-off-by: James Zhu <James.Zhu@xxxxxxx>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 653bd8f..e3f0308 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2393,6 +2393,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 	if (r)
 		goto init_failed;
 
+	r = amdgpu_amdkfd_resume_iommu(adev);
+	if (r)
+		goto init_failed;
+
 	r = amdgpu_device_ip_hw_init_phase1(adev);
 	if (r)
 		goto init_failed;
@@ -3147,6 +3151,10 @@ static int amdgpu_device_ip_resume(struct amdgpu_device *adev)
 {
 	int r;
 
+	r = amdgpu_amdkfd_resume_iommu(adev);
+	if (r)
+		return r;
+
 	r = amdgpu_device_ip_resume_phase1(adev);
 	if (r)
 		return r;
@@ -4602,6 +4610,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
 				dev_warn(tmp_adev->dev, "asic atom init failed!");
 			} else {
 				dev_info(tmp_adev->dev, "GPU reset succeeded, trying to resume\n");
+				r = amdgpu_amdkfd_resume_iommu(tmp_adev);
+				if (r)
+					goto out;
+
 				r = amdgpu_device_ip_resume_phase1(tmp_adev);
 				if (r)
 					goto out;

[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux