On Mon, Dec 16, 2024 at 10:15:00AM -0300, André Almeida wrote: > Em 16/12/2024 10:10, Christian König escreveu: > > Am 16.12.24 um 14:04 schrieb André Almeida: > > > Em 16/12/2024 07:38, Lazar, Lijo escreveu: > > > > > > > > > > > > On 12/16/2024 3:48 PM, Christian König wrote: > > > > > Am 13.12.24 um 16:56 schrieb André Almeida: > > > > > > Em 13/12/2024 11:36, Raag Jadav escreveu: > > > > > > > On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: > > > > > > > > Hi Christian, > > > > > > > > > > > > > > > > Em 13/12/2024 04:34, Christian König escreveu: > > > > > > > > > Am 12.12.24 um 20:09 schrieb André Almeida: > > > > > > > > > > Use DRM's device wedged event to notify userspace that a reset had > > > > > > > > > > happened. For now, only use `none` method meant for telemetry > > > > > > > > > > capture. > > > > > > > > > > > > > > > > > > > > Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx> > > > > > > > > > > --- > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ > > > > > > > > > > 1 file changed, 3 insertions(+) > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > > > > > > > b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c > > > > > > > > > > index 96316111300a..19e1a5493778 100644 > > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > > > > > > > @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct > > > > > > > > > > amdgpu_device *adev, > > > > > > > > > > dev_info(adev->dev, "GPU > > > > > > > > > > reset end with ret = %d\n", r); > > > > > > > > > > atomic_set(&adev->reset_domain->reset_res, r); > > > > > > > > > > + > > > > > > > > > > + drm_dev_wedged_event(adev_to_drm(adev), > > > > > > > > > > DRM_WEDGE_RECOVERY_NONE); > > > > > > > > > > > > > > > > > > That looks really good in general. I would just make the > > > > > > > > > DRM_WEDGE_RECOVERY_NONE depend on the value of "r". > > > > > > > > > > > > > > > > > > > > > > > > > Why depend or `r`? A reset was triggered anyway, regardless of the > > > > > > > > success > > > > > > > > of it, shouldn't we tell userspace? > > > > > > > > > > > > > > A failed reset would perhaps result in wedging, > > > > > > > atleast that's how i915 > > > > > > > is handling it. > > > > > > > > > > > > > > > > > > > Right, and I think this raises the question of what wedge recovery > > > > > > method should I add for amdgpu... Christian? > > > > > > > > > > > > > > > > In theory a rebind should be enough to get the device going again, our > > > > > BOCO does a bus reset on driver load anyway. > > > > > > > > > > > > > The behavior varies between SOCs. In certain ones, if driver reset > > > > fails, that means it's really in a bad state and it would need system > > > > reboot. > > > > > > > > > > Is this documented somewhere? Then I could even add a > > > DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario. This was present in drafts v5 through v7 but later dropped with the understanding that it is unwise to let a drm device make system level decisions and rather have something like WEDGED=unknown to let admin/user decide how to deal with it. https://patchwork.freedesktop.org/series/138069/ > > Not publicly as far as I know. But indeed a driver reset has basically > > the same chance of succeeding than a driver reload. > > > > I think the use case we have here is more that the administrator > > intentionally disabled the reset to allow HW investigation. > > > > So far we did that with a rather broken we don't do anything at all > > approach. > > OK. > > > > > > > I had asked earlier about the utility of this one here. If this is just > > > > to inform userspace that driver has done a reset and recovered, it would > > > > need some additional context also. We have a mechanism in KFD which > > > > sends the context in which a reset has to be done. Currently, that's > > > > restricted to compute applications, but if this is in a similar line, we > > > > would like to pass some additional info like job timeout, RAS error etc. > > > > > > > > > > DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done > > > a reset and recovered, but additional data about like which job > > > timeout, RAS error and such belong to devcoredump I guess, where all > > > data is gathered and collected later. > > > > I think somebody else mentioned it as well that the source of the issue, > > e.g. the PID of the submitting process would be helpful as well for > > supervising daemons which need to restart processes when they caused > > some issue. > > > > It was me :) we have a use case that we would need the PID for the daemon > indeed, but the daemon doesn't need to know what's the RAS error or the job > name that timeouted, there's no immediate action to be taken with this > information, contrary to the PID that we need to know. I think this calls for the standardization of telemetry (devcoredump, syslog etc) but since each driver has its own way of doing it, it'd be quite an uphill battle. Raag