On 12/16/2024 3:48 PM, Christian König wrote: > Am 13.12.24 um 16:56 schrieb André Almeida: >> Em 13/12/2024 11:36, Raag Jadav escreveu: >>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: >>>> Hi Christian, >>>> >>>> Em 13/12/2024 04:34, Christian König escreveu: >>>>> Am 12.12.24 um 20:09 schrieb André Almeida: >>>>>> Use DRM's device wedged event to notify userspace that a reset had >>>>>> happened. For now, only use `none` method meant for telemetry >>>>>> capture. >>>>>> >>>>>> Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ >>>>>> 1 file changed, 3 insertions(+) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c >>>>>> index 96316111300a..19e1a5493778 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct >>>>>> amdgpu_device *adev, >>>>>> dev_info(adev->dev, "GPU reset end with ret = %d\n", r); >>>>>> atomic_set(&adev->reset_domain->reset_res, r); >>>>>> + >>>>>> + drm_dev_wedged_event(adev_to_drm(adev), >>>>>> DRM_WEDGE_RECOVERY_NONE); >>>>> >>>>> That looks really good in general. I would just make the >>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r". >>>>> >>>> >>>> Why depend or `r`? A reset was triggered anyway, regardless of the >>>> success >>>> of it, shouldn't we tell userspace? >>> >>> A failed reset would perhaps result in wedging, atleast that's how i915 >>> is handling it. >>> >> >> Right, and I think this raises the question of what wedge recovery >> method should I add for amdgpu... Christian? >> > > In theory a rebind should be enough to get the device going again, our > BOCO does a bus reset on driver load anyway. > The behavior varies between SOCs. In certain ones, if driver reset fails, that means it's really in a bad state and it would need system reboot. I had asked earlier about the utility of this one here. If this is just to inform userspace that driver has done a reset and recovered, it would need some additional context also. We have a mechanism in KFD which sends the context in which a reset has to be done. Currently, that's restricted to compute applications, but if this is in a similar line, we would like to pass some additional info like job timeout, RAS error etc. Thanks, Lijo > Regards, > Christian.