Re: [PATCH 1/1] drm/amdgpu: Use device wedged event

Christian König <christian.koenig@xxxxxxx> · Mon, 16 Dec 2024 14:10:46 +0100

Am 16.12.24 um 14:04 schrieb André Almeida:
Em 16/12/2024 07:38, Lazar, Lijo escreveu:


On 12/16/2024 3:48 PM, Christian König wrote:
Am 13.12.24 um 16:56 schrieb André Almeida:
Em 13/12/2024 11:36, Raag Jadav escreveu:
On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
Hi Christian,

Em 13/12/2024 04:34, Christian König escreveu:
Am 12.12.24 um 20:09 schrieb André Almeida:
Use DRM's device wedged event to notify userspace that a reset had
happened. For now, only use `none` method meant for telemetry
capture.

Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
    1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
index 96316111300a..19e1a5493778 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
amdgpu_device *adev,
            dev_info(adev->dev, "GPU reset end with ret = 
%d\n", r);
atomic_set(&adev->reset_domain->reset_res, r);
+
+    drm_dev_wedged_event(adev_to_drm(adev),
DRM_WEDGE_RECOVERY_NONE);

That looks really good in general. I would just make the
DRM_WEDGE_RECOVERY_NONE depend on the value of "r".


Why depend or `r`? A reset was triggered anyway, regardless of the
success
of it, shouldn't we tell userspace?

A failed reset would perhaps result in wedging, atleast that's how 
i915
is handling it.


Right, and I think this raises the question of what wedge recovery
method should I add for amdgpu... Christian?


In theory a rebind should be enough to get the device going again, our
BOCO does a bus reset on driver load anyway.


The behavior varies between SOCs. In certain ones, if driver reset
fails, that means it's really in a bad state and it would need system
reboot.


Is this documented somewhere? Then I could even add a 
DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.

Not publicly as far as I know. But indeed a driver reset has basically 
the same chance of succeeding than a driver reload.

I think the use case we have here is more that the administrator 
intentionally disabled the reset to allow HW investigation.

So far we did that with a rather broken we don't do anything at all 
approach.

I had asked earlier about the utility of this one here. If this is just
to inform userspace that driver has done a reset and recovered, it would
need some additional context also. We have a mechanism in KFD which
sends the context in which a reset has to be done. Currently, that's
restricted to compute applications, but if this is in a similar line, we
would like to pass some additional info like job timeout, RAS error etc.


DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a 
reset and recovered, but additional data about like which job timeout, 
RAS error and such belong to devcoredump I guess, where all data is 
gathered and collected later.

I think somebody else mentioned it as well that the source of the issue, 
e.g. the PID of the submitting process would be helpful as well for 
supervising daemons which need to restart processes when they caused 
some issue.

We just postponed adding that till later.

Regards,
Christian.


Thanks,
Lijo

Regards,
Christian.