Re: [PATCH 1/1] drm/amdgpu: Use device wedged event

Christian König <christian.koenig@xxxxxxx> · Mon, 16 Dec 2024 11:18:22 +0100

Am 13.12.24 um 16:56 schrieb André Almeida:
Em 13/12/2024 11:36, Raag Jadav escreveu:
On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
Hi Christian,

Em 13/12/2024 04:34, Christian König escreveu:
Am 12.12.24 um 20:09 schrieb André Almeida:
Use DRM's device wedged event to notify userspace that a reset had
happened. For now, only use `none` method meant for telemetry
capture.

Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
   1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
index 96316111300a..19e1a5493778 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
amdgpu_device *adev,
           dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
atomic_set(&adev->reset_domain->reset_res, r);
+
+    drm_dev_wedged_event(adev_to_drm(adev), 
DRM_WEDGE_RECOVERY_NONE);

That looks really good in general. I would just make the
DRM_WEDGE_RECOVERY_NONE depend on the value of "r".


Why depend or `r`? A reset was triggered anyway, regardless of the 
success
of it, shouldn't we tell userspace?

A failed reset would perhaps result in wedging, atleast that's how i915
is handling it.


Right, and I think this raises the question of what wedge recovery 
method should I add for amdgpu... Christian?


In theory a rebind should be enough to get the device going again, our 
BOCO does a bus reset on driver load anyway.

Regards,
Christian.