Re: [PATCH] drm/amdgpu: add ring reset messages

Alex Deucher <alexdeucher@xxxxxxxxx> · Mon, 28 Oct 2024 12:29:48 -0400

On Mon, Oct 28, 2024 at 11:41 AM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:
>
>
>
> On 10/28/2024 8:11 PM, Alex Deucher wrote:
> > Ping?
> >
> > On Fri, Oct 18, 2024 at 11:47 AM Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
> >>
> >> Ping?
> >>
> >> On Tue, Oct 15, 2024 at 2:28 PM Alex Deucher <alexander.deucher@xxxxxxx> wrote:
> >>>
> >>> Add messages to make it clear when a per ring reset
> >>> happens.  This is helpful for debugging and aligns with
> >>> other reset methods.
> >>>
> >>> Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
> >>> ---
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +++
> >>>  1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> index 102742f1faa2..2d60552a13ac 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> @@ -137,6 +137,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>>         /* attempt a per ring reset */
> >>>         if (amdgpu_gpu_recovery &&
> >>>             ring->funcs->reset) {
> >>> +               dev_err(adev->dev, "Starting %s ring reset\n", s_job->sched->name);
>
> Is dev_err intentional or dev_info is good enough? Also, suggest to add
> ring name to fail/pass messages.

I was being consistent with the other messages from this function.
They are all dev_err.  Will add the ring name.

Thanks,

Alex

>
> Thanks,
> Lijo
>
> >>>                 /* stop the scheduler, but don't mess with the
> >>>                  * bad job yet because if ring reset fails
> >>>                  * we'll fall back to full GPU reset.
> >>> @@ -150,8 +151,10 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>>                         amdgpu_fence_driver_force_completion(ring);
> >>>                         if (amdgpu_ring_sched_ready(ring))
> >>>                                 drm_sched_start(&ring->sched);
> >>> +                       dev_err(adev->dev, "Ring reset success\n");>>>                         goto exit;
> >>>                 }
> >>> +               dev_err(adev->dev, "Ring reset failure\n");
> >>>         }
> >>>
> >>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
> >>> --
> >>> 2.46.2
> >>>