Am 2021-09-07 um 1:22 p.m. schrieb James Zhu: > > > On 2021-09-07 12:48 p.m., Felix Kuehling wrote: >> Am 2021-09-07 um 12:07 p.m. schrieb James Zhu: >>> Separate iommu_resume from kfd_resume, and move it before >>> other amdgpu ip init/resume. >>> >>> Fixed Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211277 >> I think the change is OK. But I don't understand how the IOMMUv2 >> initialization sequence could affect a crash in DM. The display should >> not depend on IOMMUv2 at all. What am I missing? > > [JZ] It is a weird issue. disable VCN IP block or disable gpu_off > feature, or set pci=noats, all > > can fix DM crash. Also the issue occurred quite random, some time > after few suspend/resume cycle, > > some times after few hundreds S/R cycles. the maximum that I saw is > 2422 S/R cycles. > > But every time DM crash, I can see one or two iommu errors ahead: > > *AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=**** > flags=0x0070]* This error is not from IOMMUv2 doing GVA to GPA translations. It's from IOMMUv1 doing GPA to SPA translation. This error points to an invalid physical (GVA) address being used by the GPU to access random system memory it shouldn't be accessing (because there is no valid DMA mapping). On AMD systems, IOMMUv1 tends to be in pass-through mode when IOMMUv2 is enabled. It's possible that the earlier initialization of IOMMUv2 hides the problem by putting the IOMMU into passthrough mode. I don't think this patch series is a valid solution. You can probably fix the problem with this kernel boot parameter: iommu=pt And you can probably reproduce it even with this patch series if instead you add: iommu=nopt amd_iommu=force_isolation Regards, Felix > Since we can't stop HW/FW/SW right the way after IO page fault > detected, so I can't tell which part try to access > system memory through IOMMU. > > But after moving IOMMU device init before other amdgpu IP init/resume, > the DM crash /IOMMU page fault issues are gone. > > Those patches can't directly explain why the issue fixed, but this new > sequence makes more sense to me. > > Can I have you RB on those patches? > > Thanks! > James > >> Regards, >> Felix >> >> >>> Signed-off-by: James Zhu <James.Zhu@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++++++++ >>> 1 file changed, 12 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 653bd8f..e3f0308 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -2393,6 +2393,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev) >>> if (r) >>> goto init_failed; >>> >>> + r = amdgpu_amdkfd_resume_iommu(adev); >>> + if (r) >>> + goto init_failed; >>> + >>> r = amdgpu_device_ip_hw_init_phase1(adev); >>> if (r) >>> goto init_failed; >>> @@ -3147,6 +3151,10 @@ static int amdgpu_device_ip_resume(struct amdgpu_device *adev) >>> { >>> int r; >>> >>> + r = amdgpu_amdkfd_resume_iommu(adev); >>> + if (r) >>> + return r; >>> + >>> r = amdgpu_device_ip_resume_phase1(adev); >>> if (r) >>> return r; >>> @@ -4602,6 +4610,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, >>> dev_warn(tmp_adev->dev, "asic atom init failed!"); >>> } else { >>> dev_info(tmp_adev->dev, "GPU reset succeeded, trying to resume\n"); >>> + r = amdgpu_amdkfd_resume_iommu(tmp_adev); >>> + if (r) >>> + goto out; >>> + >>> r = amdgpu_device_ip_resume_phase1(tmp_adev); >>> if (r) >>> goto out;