Re: [PATCH 2/2] drm/amdgpu: reset gpu for pm abort case

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Mon, 29 Jan 2024 12:18:16 +0530

On 1/26/2024 2:30 PM, Liang, Prike wrote:
> [AMD Official Use Only - General]
> 
>>
>> On 1/25/2024 8:52 AM, Prike Liang wrote:
>>> In the pm abort case the gfx power rail not turn off from FCH side and
>>> this will lead to the gfx reinitialized failed base on the unknown gfx
>>> HW status, so let's reset the gpu to a known good power state.
>>>
>>
>> From the description, this an APU only problem (or this patch could only
>> resolve APU abort sequence). However, there is no check for APU in the patch
>> below.
>>
> [Prike]  IIRC, there also has a similar problem on the dGPU side when suspend abort and
> now this patch is only drafted for a hot issue on the RV series. If need we can add a TODO
> item for drafting a more generic solution.
> 

If this addresses a specific issue, then better to check the specific IP
revision before presenting this as a generic one. Presently the patch
logic considers this as a generic for all soc15 asics.

>>
>>> Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
>>>  drivers/gpu/drm/amd/amdgpu/soc15.c         | 8 +++++++-
>>>  2 files changed, 12 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 56d9dfa61290..4c40ffaaa5c2 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4627,6 +4627,11 @@ int amdgpu_device_resume(struct drm_device
>> *dev, bool fbcon)
>>>                     return r;
>>>     }
>>>
>>> +   if(amdgpu_asic_need_reset_on_init(adev)) {
>>> +           DRM_INFO("PM abort case and let's reset asic \n");
>>> +           amdgpu_asic_reset(adev);
>>> +   }
>>> +
>>
>> suspend_noirq is specific for suspend scenarios and not valid for freeze/thaw.
>> I guess this could trigger reset for successful restore on APUs.
>>
> [Prike] If doesn't run into noirq_suspend then still need further check whether the PSP TOS is still alive before gpu reset.
> 

AFAIU, for a successful resume from hibernate on APUs, TOS will still be
running. The patch will trigger a reset in such cases also.

Thanks,
Lijo

>>>     if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
>>>             return 0;
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
>>> b/drivers/gpu/drm/amd/amdgpu/soc15.c
>>> index 15033efec2ba..9329a00b6abc 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
>>> @@ -804,9 +804,16 @@ static bool soc15_need_reset_on_init(struct
>> amdgpu_device *adev)
>>>     if (adev->asic_type == CHIP_RENOIR)
>>>             return true;
>>>
>>> +   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
>>> +
>>>     /* Just return false for soc15 GPUs.  Reset does not seem to
>>>      * be necessary.
>>>      */
>>
>> The comment now doesn't make sense.
>>
>> Thanks,
>> Lijo
>>
>>> +   if (adev->in_suspend && !adev->in_s0ix &&
>>> +                   !adev->pm_complete &&
>>> +                   sol_reg)
>>> +           return true;
>>> +
>>>     if (!amdgpu_passthrough(adev))
>>>             return false;
>>>
>>> @@ -816,7 +823,6 @@ static bool soc15_need_reset_on_init(struct
>> amdgpu_device *adev)
>>>     /* Check sOS sign of life register to confirm sys driver and sOS
>>>      * are already been loaded.
>>>      */
>>> -   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
>>>     if (sol_reg)
>>>             return true;
>>>