Re: [PATCH 1/1] drm/amdgpu: Use device wedged event

Raag Jadav <raag.jadav@xxxxxxxxx> · Mon, 16 Dec 2024 15:57:59 +0200

On Mon, Dec 16, 2024 at 10:15:00AM -0300, André Almeida wrote:
> Em 16/12/2024 10:10, Christian König escreveu:
> > Am 16.12.24 um 14:04 schrieb André Almeida:
> > > Em 16/12/2024 07:38, Lazar, Lijo escreveu:
> > > > 
> > > > 
> > > > On 12/16/2024 3:48 PM, Christian König wrote:
> > > > > Am 13.12.24 um 16:56 schrieb André Almeida:
> > > > > > Em 13/12/2024 11:36, Raag Jadav escreveu:
> > > > > > > On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
> > > > > > > > Hi Christian,
> > > > > > > > 
> > > > > > > > Em 13/12/2024 04:34, Christian König escreveu:
> > > > > > > > > Am 12.12.24 um 20:09 schrieb André Almeida:
> > > > > > > > > > Use DRM's device wedged event to notify userspace that a reset had
> > > > > > > > > > happened. For now, only use `none` method meant for telemetry
> > > > > > > > > > capture.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx>
> > > > > > > > > > ---
> > > > > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> > > > > > > > > >     1 file changed, 3 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > index 96316111300a..19e1a5493778 100644
> > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
> > > > > > > > > > amdgpu_device *adev,
> > > > > > > > > >             dev_info(adev->dev, "GPU
> > > > > > > > > > reset end with ret = %d\n", r);
> > > > > > > > > > atomic_set(&adev->reset_domain->reset_res, r);
> > > > > > > > > > +
> > > > > > > > > > +    drm_dev_wedged_event(adev_to_drm(adev),
> > > > > > > > > > DRM_WEDGE_RECOVERY_NONE);
> > > > > > > > > 
> > > > > > > > > That looks really good in general. I would just make the
> > > > > > > > > DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Why depend or `r`? A reset was triggered anyway, regardless of the
> > > > > > > > success
> > > > > > > > of it, shouldn't we tell userspace?
> > > > > > > 
> > > > > > > A failed reset would perhaps result in wedging,
> > > > > > > atleast that's how i915
> > > > > > > is handling it.
> > > > > > > 
> > > > > > 
> > > > > > Right, and I think this raises the question of what wedge recovery
> > > > > > method should I add for amdgpu... Christian?
> > > > > > 
> > > > > 
> > > > > In theory a rebind should be enough to get the device going again, our
> > > > > BOCO does a bus reset on driver load anyway.
> > > > > 
> > > > 
> > > > The behavior varies between SOCs. In certain ones, if driver reset
> > > > fails, that means it's really in a bad state and it would need system
> > > > reboot.
> > > > 
> > > 
> > > Is this documented somewhere? Then I could even add a
> > > DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.

This was present in drafts v5 through v7 but later dropped with the
understanding that it is unwise to let a drm device make system level
decisions and rather have something like WEDGED=unknown to let admin/user
decide how to deal with it.

https://patchwork.freedesktop.org/series/138069/

> > Not publicly as far as I know. But indeed a driver reset has basically
> > the same chance of succeeding than a driver reload.
> > 
> > I think the use case we have here is more that the administrator
> > intentionally disabled the reset to allow HW investigation.
> > 
> > So far we did that with a rather broken we don't do anything at all
> > approach.
> 
> OK.
> 
> > 
> > > > I had asked earlier about the utility of this one here. If this is just
> > > > to inform userspace that driver has done a reset and recovered, it would
> > > > need some additional context also. We have a mechanism in KFD which
> > > > sends the context in which a reset has to be done. Currently, that's
> > > > restricted to compute applications, but if this is in a similar line, we
> > > > would like to pass some additional info like job timeout, RAS error etc.
> > > > 
> > > 
> > > DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done
> > > a reset and recovered, but additional data about like which job
> > > timeout, RAS error and such belong to devcoredump I guess, where all
> > > data is gathered and collected later.
> > 
> > I think somebody else mentioned it as well that the source of the issue,
> > e.g. the PID of the submitting process would be helpful as well for
> > supervising daemons which need to restart processes when they caused
> > some issue.
> > 
> 
> It was me :) we have a use case that we would need the PID for the daemon
> indeed, but the daemon doesn't need to know what's the RAS error or the job
> name that timeouted, there's no immediate action to be taken with this
> information, contrary to the PID that we need to know.

I think this calls for the standardization of telemetry (devcoredump, syslog
etc) but since each driver has its own way of doing it, it'd be quite an uphill
battle.

Raag