Am 2021-09-07 um 1:51 p.m. schrieb Felix Kuehling: > Am 2021-09-07 um 1:22 p.m. schrieb James Zhu: >> On 2021-09-07 12:48 p.m., Felix Kuehling wrote: >>> Am 2021-09-07 um 12:07 p.m. schrieb James Zhu: >>>> Separate iommu_resume from kfd_resume, and move it before >>>> other amdgpu ip init/resume. >>>> >>>> Fixed Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211277 >>> I think the change is OK. But I don't understand how the IOMMUv2 >>> initialization sequence could affect a crash in DM. The display should >>> not depend on IOMMUv2 at all. What am I missing? >> [JZ] It is a weird issue. disable VCN IP block or disable gpu_off >> feature, or set pci=noats, all >> >> can fix DM crash. Also the issue occurred quite random, some time >> after few suspend/resume cycle, >> >> some times after few hundreds S/R cycles. the maximum that I saw is >> 2422 S/R cycles. >> >> But every time DM crash, I can see one or two iommu errors ahead: >> >> *AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=**** >> flags=0x0070]* > This error is not from IOMMUv2 doing GVA to GPA translations. It's from > IOMMUv1 doing GPA to SPA translation. This error points to an invalid > physical (GVA) address being used by the GPU to access random system Oops: s/GVA/GPA > memory it shouldn't be accessing (because there is no valid DMA mapping). > > On AMD systems, IOMMUv1 tends to be in pass-through mode when IOMMUv2 is > enabled. It's possible that the earlier initialization of IOMMUv2 hides > the problem by putting the IOMMU into passthrough mode. I don't think > this patch series is a valid solution. > > You can probably fix the problem with this kernel boot parameter: iommu=pt > > And you can probably reproduce it even with this patch series if instead > you add: iommu=nopt amd_iommu=force_isolation > > Regards, > Felix > > >> Since we can't stop HW/FW/SW right the way after IO page fault >> detected, so I can't tell which part try to access >> system memory through IOMMU. >> >> But after moving IOMMU device init before other amdgpu IP init/resume, >> the DM crash /IOMMU page fault issues are gone. >> >> Those patches can't directly explain why the issue fixed, but this new >> sequence makes more sense to me. >> >> Can I have you RB on those patches? >> >> Thanks! >> James >> >>> Regards, >>> Felix >>> >>> >>>> Signed-off-by: James Zhu <James.Zhu@xxxxxxx> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++++++++ >>>> 1 file changed, 12 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> index 653bd8f..e3f0308 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> @@ -2393,6 +2393,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev) >>>> if (r) >>>> goto init_failed; >>>> >>>> + r = amdgpu_amdkfd_resume_iommu(adev); >>>> + if (r) >>>> + goto init_failed; >>>> + >>>> r = amdgpu_device_ip_hw_init_phase1(adev); >>>> if (r) >>>> goto init_failed; >>>> @@ -3147,6 +3151,10 @@ static int amdgpu_device_ip_resume(struct amdgpu_device *adev) >>>> { >>>> int r; >>>> >>>> + r = amdgpu_amdkfd_resume_iommu(adev); >>>> + if (r) >>>> + return r; >>>> + >>>> r = amdgpu_device_ip_resume_phase1(adev); >>>> if (r) >>>> return r; >>>> @@ -4602,6 +4610,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, >>>> dev_warn(tmp_adev->dev, "asic atom init failed!"); >>>> } else { >>>> dev_info(tmp_adev->dev, "GPU reset succeeded, trying to resume\n"); >>>> + r = amdgpu_amdkfd_resume_iommu(tmp_adev); >>>> + if (r) >>>> + goto out; >>>> + >>>> r = amdgpu_device_ip_resume_phase1(tmp_adev); >>>> if (r) >>>> goto out;