Re: [PATCH 1/1] drm/amdgpu: Use device wedged event

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Mon, 16 Dec 2024 19:06:38 +0530

On 12/16/2024 6:45 PM, André Almeida wrote:
> 
> 
> Em 16/12/2024 10:10, Christian König escreveu:
>> Am 16.12.24 um 14:04 schrieb André Almeida:
>>> Em 16/12/2024 07:38, Lazar, Lijo escreveu:
>>>>
>>>>
>>>> On 12/16/2024 3:48 PM, Christian König wrote:
>>>>> Am 13.12.24 um 16:56 schrieb André Almeida:
>>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>>>>>> Hi Christian,
>>>>>>>>
>>>>>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>>>>>> Use DRM's device wedged event to notify userspace that a reset
>>>>>>>>>> had
>>>>>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>>>>>> capture.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx>
>>>>>>>>>> ---
>>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>             dev_info(adev->dev, "GPU reset end with ret =
>>>>>>>>>> %d\n", r);
>>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>>>> +
>>>>>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>>>>>
>>>>>>>>> That looks really good in general. I would just make the
>>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>>>>>> success
>>>>>>>> of it, shouldn't we tell userspace?
>>>>>>>
>>>>>>> A failed reset would perhaps result in wedging, atleast that's
>>>>>>> how i915
>>>>>>> is handling it.
>>>>>>>
>>>>>>
>>>>>> Right, and I think this raises the question of what wedge recovery
>>>>>> method should I add for amdgpu... Christian?
>>>>>>
>>>>>
>>>>> In theory a rebind should be enough to get the device going again, our
>>>>> BOCO does a bus reset on driver load anyway.
>>>>>
>>>>
>>>> The behavior varies between SOCs. In certain ones, if driver reset
>>>> fails, that means it's really in a bad state and it would need system
>>>> reboot.
>>>>
>>>
>>> Is this documented somewhere? Then I could even add a
>>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.
>>
>> Not publicly as far as I know. But indeed a driver reset has basically
>> the same chance of succeeding than a driver reload.
>>
>> I think the use case we have here is more that the administrator
>> intentionally disabled the reset to allow HW investigation.
>>
>> So far we did that with a rather broken we don't do anything at all
>> approach.
> 
> OK.
> 
>>
>>>> I had asked earlier about the utility of this one here. If this is just
>>>> to inform userspace that driver has done a reset and recovered, it
>>>> would
>>>> need some additional context also. We have a mechanism in KFD which
>>>> sends the context in which a reset has to be done. Currently, that's
>>>> restricted to compute applications, but if this is in a similar
>>>> line, we
>>>> would like to pass some additional info like job timeout, RAS error
>>>> etc.
>>>>
>>>
>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
>>> reset and recovered, but additional data about like which job
>>> timeout, RAS error and such belong to devcoredump I guess, where all
>>> data is gathered and collected later.
>>
>> I think somebody else mentioned it as well that the source of the
>> issue, e.g. the PID of the submitting process would be helpful as well
>> for supervising daemons which need to restart processes when they
>> caused some issue.
>>
> 
> It was me :) we have a use case that we would need the PID for the
> daemon indeed, but the daemon doesn't need to know what's the RAS error
> or the job name that timeouted, there's no immediate action to be taken
> with this information, contrary to the PID that we need to know.
> 

Regarding devcoredump - it's not done every time. For ex: RAS errors
have a different way to identify the source of error, hence we don't
need a coredump in such cases.

The intention is only to let the user know the reason for reset at a
high level, and probably add more things later like the engines or
queues that have reset etc.

Thanks,
Lijo

>> We just postponed adding that till later.
>>
>> Regards,
>> Christian.
>>
>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>
>>>
>>
>