On 12/16/2024 6:45 PM, André Almeida wrote: > > > Em 16/12/2024 10:10, Christian König escreveu: >> Am 16.12.24 um 14:04 schrieb André Almeida: >>> Em 16/12/2024 07:38, Lazar, Lijo escreveu: >>>> >>>> >>>> On 12/16/2024 3:48 PM, Christian König wrote: >>>>> Am 13.12.24 um 16:56 schrieb André Almeida: >>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu: >>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: >>>>>>>> Hi Christian, >>>>>>>> >>>>>>>> Em 13/12/2024 04:34, Christian König escreveu: >>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida: >>>>>>>>>> Use DRM's device wedged event to notify userspace that a reset >>>>>>>>>> had >>>>>>>>>> happened. For now, only use `none` method meant for telemetry >>>>>>>>>> capture. >>>>>>>>>> >>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx> >>>>>>>>>> --- >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ >>>>>>>>>> 1 file changed, 3 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> index 96316111300a..19e1a5493778 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct >>>>>>>>>> amdgpu_device *adev, >>>>>>>>>> dev_info(adev->dev, "GPU reset end with ret = >>>>>>>>>> %d\n", r); >>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r); >>>>>>>>>> + >>>>>>>>>> + drm_dev_wedged_event(adev_to_drm(adev), >>>>>>>>>> DRM_WEDGE_RECOVERY_NONE); >>>>>>>>> >>>>>>>>> That looks really good in general. I would just make the >>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r". >>>>>>>>> >>>>>>>> >>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the >>>>>>>> success >>>>>>>> of it, shouldn't we tell userspace? >>>>>>> >>>>>>> A failed reset would perhaps result in wedging, atleast that's >>>>>>> how i915 >>>>>>> is handling it. >>>>>>> >>>>>> >>>>>> Right, and I think this raises the question of what wedge recovery >>>>>> method should I add for amdgpu... Christian? >>>>>> >>>>> >>>>> In theory a rebind should be enough to get the device going again, our >>>>> BOCO does a bus reset on driver load anyway. >>>>> >>>> >>>> The behavior varies between SOCs. In certain ones, if driver reset >>>> fails, that means it's really in a bad state and it would need system >>>> reboot. >>>> >>> >>> Is this documented somewhere? Then I could even add a >>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario. >> >> Not publicly as far as I know. But indeed a driver reset has basically >> the same chance of succeeding than a driver reload. >> >> I think the use case we have here is more that the administrator >> intentionally disabled the reset to allow HW investigation. >> >> So far we did that with a rather broken we don't do anything at all >> approach. > > OK. > >> >>>> I had asked earlier about the utility of this one here. If this is just >>>> to inform userspace that driver has done a reset and recovered, it >>>> would >>>> need some additional context also. We have a mechanism in KFD which >>>> sends the context in which a reset has to be done. Currently, that's >>>> restricted to compute applications, but if this is in a similar >>>> line, we >>>> would like to pass some additional info like job timeout, RAS error >>>> etc. >>>> >>> >>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a >>> reset and recovered, but additional data about like which job >>> timeout, RAS error and such belong to devcoredump I guess, where all >>> data is gathered and collected later. >> >> I think somebody else mentioned it as well that the source of the >> issue, e.g. the PID of the submitting process would be helpful as well >> for supervising daemons which need to restart processes when they >> caused some issue. >> > > It was me :) we have a use case that we would need the PID for the > daemon indeed, but the daemon doesn't need to know what's the RAS error > or the job name that timeouted, there's no immediate action to be taken > with this information, contrary to the PID that we need to know. > Regarding devcoredump - it's not done every time. For ex: RAS errors have a different way to identify the source of error, hence we don't need a coredump in such cases. The intention is only to let the user know the reason for reset at a high level, and probably add more things later like the engines or queues that have reset etc. Thanks, Lijo >> We just postponed adding that till later. >> >> Regards, >> Christian. >> >>> >>>> Thanks, >>>> Lijo >>>> >>>>> Regards, >>>>> Christian. >>>> >>> >> >