Re: [PATCH] drm/amdgpu: report bad status in GPU recovery

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Thu, 1 Aug 2024 12:11:04 +0530

On 8/1/2024 11:28 AM, Zhou1, Tao wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> Yes, the bad status message is printed twice with this patch. I think it's harmless and the second message is more convenient for customer.
> 
> I can add a parameter for amdgpu_ras_eeprom_check_err_threshold to disable the first message if you think printing message twice is not a good idea.
> 

Instead of this way, can't this be added to amdgpu_ras_do_recovery() and
stop all recovery actions?

Thanks,
Lijo

> Tao
> 
>> -----Original Message-----
>> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
>> Sent: Thursday, August 1, 2024 1:30 PM
>> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>>
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> Right, it's functional. My concern is whether the kernel message in
>> amdgpu_ras_eeprom_check_err_threshold will be printed twice. This is the end
>> of gpu recovery (i.e., report gpu reset failed or gpu reset succeed).
>> Check_err_threshold was already done before reaching here.
>>
>> Regards,
>> Hawking
>>
>> -----Original Message-----
>> From: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
>> Sent: Thursday, August 1, 2024 11:49
>> To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>>
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> I think the if condition in amdgpu_ras_eeprom_check_err_threshold is good
>> enough, no need to update it with is_rma.
>>
>> Tao
>>
>>> -----Original Message-----
>>> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
>>> Sent: Thursday, August 1, 2024 11:00 AM
>>> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
>>> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>>>
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> Might consider leverage is_RMA flag for the same purpose?
>>>
>>> Regards,
>>> Hawking
>>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Tao
>>> Zhou
>>> Sent: Wednesday, July 31, 2024 18:05
>>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
>>> Subject: [PATCH] drm/amdgpu: report bad status in GPU recovery
>>>
>>> Instead of printing GPU reset failed.
>>>
>>> Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 355c2478c4b6..b7c967779b4b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
>>> amdgpu_device *adev,
>>>                 tmp_adev->asic_reset_res = 0;
>>>
>>>                 if (r) {
>>> -                       /* bad news, how to tell it to userspace ? */
>>> -                       dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
>>> atomic_read(&tmp_adev->gpu_reset_counter));
>>> +                       /* bad news, how to tell it to userspace ?
>>> +                        * for ras error, we should report GPU bad status instead of
>>> +                        * reset failure
>>> +                        */
>>> +                       if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
>>> +                               dev_info(tmp_adev->dev, "GPU reset(%d)
>>> + failed\n",
>>> +
>>> + atomic_read(&tmp_adev->gpu_reset_counter));
>>>                         amdgpu_vf_error_put(tmp_adev,
>>> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>>>                 } else {
>>>                         dev_info(tmp_adev->dev, "GPU reset(%d)
>>> succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
>>> --
>>> 2.34.1
>>>
>>
>>
>