Re: [PATCH 3/3] drm/amdgpu: move iommu_resume before ip init/resume

Felix Kuehling <felix.kuehling@xxxxxxx> · Tue, 7 Sep 2021 13:53:41 -0400



Am 2021-09-07 um 1:51 p.m. schrieb Felix Kuehling:
> Am 2021-09-07 um 1:22 p.m. schrieb James Zhu:
>> On 2021-09-07 12:48 p.m., Felix Kuehling wrote:
>>> Am 2021-09-07 um 12:07 p.m. schrieb James Zhu:
>>>> Separate iommu_resume from kfd_resume, and move it before
>>>> other amdgpu ip init/resume.
>>>>
>>>> Fixed Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211277
>>> I think the change is OK. But I don't understand how the IOMMUv2
>>> initialization sequence could affect a crash in DM. The display should
>>> not depend on IOMMUv2 at all. What am I missing?
>> [JZ] It is a weird issue. disable VCN IP block or disable gpu_off
>> feature, or set pci=noats, all
>>
>> can fix DM crash. Also the issue occurred quite random, some time
>> after few suspend/resume cycle,
>>
>> some times after few hundreds S/R cycles. the maximum that I saw is
>> 2422 S/R cycles.
>>
>> But every time DM crash, I can see one or two iommu errors ahead:
>>
>> *AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=****
>> flags=0x0070]*
> This error is not from IOMMUv2 doing GVA to GPA translations. It's from
> IOMMUv1 doing GPA to SPA translation. This error points to an invalid
> physical (GVA) address being used by the GPU to access random system

Oops: s/GVA/GPA


> memory it shouldn't be accessing (because there is no valid DMA mapping).
>
> On AMD systems, IOMMUv1 tends to be in pass-through mode when IOMMUv2 is
> enabled. It's possible that the earlier initialization of IOMMUv2 hides
> the problem by putting the IOMMU into passthrough mode. I don't think
> this patch series is a valid solution.
>
> You can probably fix the problem with this kernel boot parameter: iommu=pt
>
> And you can probably reproduce it even with this patch series if instead
> you add: iommu=nopt amd_iommu=force_isolation
>
> Regards,
>   Felix
>
>
>> Since we can't stop HW/FW/SW right the way after IO page fault
>> detected, so I can't tell which part try to access
>> system memory through IOMMU.
>>
>> But after moving IOMMU device init before other amdgpu IP init/resume,
>> the DM crash /IOMMU page fault issues are gone.
>>
>> Those patches can't directly explain why the issue fixed, but this new
>> sequence makes more sense to me.
>>
>> Can I have you RB on those patches?
>>
>> Thanks!
>> James
>>
>>> Regards,
>>>   Felix
>>>
>>>
>>>> Signed-off-by: James Zhu <James.Zhu@xxxxxxx>
>>>> ---
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++++++++
>>>>  1 file changed, 12 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 653bd8f..e3f0308 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -2393,6 +2393,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
>>>>  	if (r)
>>>>  		goto init_failed;
>>>>  
>>>> +	r = amdgpu_amdkfd_resume_iommu(adev);
>>>> +	if (r)
>>>> +		goto init_failed;
>>>> +
>>>>  	r = amdgpu_device_ip_hw_init_phase1(adev);
>>>>  	if (r)
>>>>  		goto init_failed;
>>>> @@ -3147,6 +3151,10 @@ static int amdgpu_device_ip_resume(struct amdgpu_device *adev)
>>>>  {
>>>>  	int r;
>>>>  
>>>> +	r = amdgpu_amdkfd_resume_iommu(adev);
>>>> +	if (r)
>>>> +		return r;
>>>> +
>>>>  	r = amdgpu_device_ip_resume_phase1(adev);
>>>>  	if (r)
>>>>  		return r;
>>>> @@ -4602,6 +4610,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>  				dev_warn(tmp_adev->dev, "asic atom init failed!");
>>>>  			} else {
>>>>  				dev_info(tmp_adev->dev, "GPU reset succeeded, trying to resume\n");
>>>> +				r = amdgpu_amdkfd_resume_iommu(tmp_adev);
>>>> +				if (r)
>>>> +					goto out;
>>>> +
>>>>  				r = amdgpu_device_ip_resume_phase1(tmp_adev);
>>>>  				if (r)
>>>>  					goto out;