Re: [PATCH 1/1] drm/amdgpu: Use device wedged event

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Mon, 16 Dec 2024 16:08:09 +0530

On 12/16/2024 3:48 PM, Christian König wrote:
> Am 13.12.24 um 16:56 schrieb André Almeida:
>> Em 13/12/2024 11:36, Raag Jadav escreveu:
>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
>>>> Hi Christian,
>>>>
>>>> Em 13/12/2024 04:34, Christian König escreveu:
>>>>> Am 12.12.24 um 20:09 schrieb André Almeida:
>>>>>> Use DRM's device wedged event to notify userspace that a reset had
>>>>>> happened. For now, only use `none` method meant for telemetry
>>>>>> capture.
>>>>>>
>>>>>> Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
>>>>>>    1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 96316111300a..19e1a5493778 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
>>>>>> amdgpu_device *adev,
>>>>>>            dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
>>>>>> atomic_set(&adev->reset_domain->reset_res, r);
>>>>>> +
>>>>>> +    drm_dev_wedged_event(adev_to_drm(adev),
>>>>>> DRM_WEDGE_RECOVERY_NONE);
>>>>>
>>>>> That looks really good in general. I would just make the
>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
>>>>>
>>>>
>>>> Why depend or `r`? A reset was triggered anyway, regardless of the
>>>> success
>>>> of it, shouldn't we tell userspace?
>>>
>>> A failed reset would perhaps result in wedging, atleast that's how i915
>>> is handling it.
>>>
>>
>> Right, and I think this raises the question of what wedge recovery
>> method should I add for amdgpu... Christian?
>>
> 
> In theory a rebind should be enough to get the device going again, our
> BOCO does a bus reset on driver load anyway.
> 

The behavior varies between SOCs. In certain ones, if driver reset
fails, that means it's really in a bad state and it would need system
reboot.

I had asked earlier about the utility of this one here. If this is just
to inform userspace that driver has done a reset and recovered, it would
need some additional context also. We have a mechanism in KFD which
sends the context in which a reset has to be done. Currently, that's
restricted to compute applications, but if this is in a similar line, we
would like to pass some additional info like job timeout, RAS error etc.

Thanks,
Lijo

> Regards,
> Christian.