RE: [PATCH 3/3] drm/amdgpu: Change the timing of doing coredump

"Huang, Trigger" <Trigger.Huang@xxxxxxx> · Fri, 16 Aug 2024 06:37:06 +0000

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Alex Deucher <alexdeucher@xxxxxxxxx>
> Sent: Friday, August 16, 2024 12:09 AM
> To: Huang, Trigger <Trigger.Huang@xxxxxxx>
> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Khatri, Sunil <Sunil.Khatri@xxxxxxx>;
> Deucher, Alexander <Alexander.Deucher@xxxxxxx>
> Subject: Re: [PATCH 3/3] drm/amdgpu: Change the timing of doing coredump
>
> On Thu, Aug 15, 2024 at 7:50 AM <Trigger.Huang@xxxxxxx> wrote:
> >
> > From: Trigger Huang <Trigger.Huang@xxxxxxx>
> >
> > Do the coredump immediately after a job timeout to get a closer
> > representation of GPU's error status. For other code paths that need
> > to do the coredump, keep the original logic unchanged, except:
> > 1,All the coredump operations will be under the control of parameter
> > amdgpu_gpu_coredump 2, Do the ip dump and register to dev codedump
> > system together.
> >
> > Signed-off-by: Trigger Huang <Trigger.Huang@xxxxxxx>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++-------------
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 10 ++++++++++
> >  2 files changed, 13 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 69d6a5b7ca34..a8eb76d82921 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -5341,15 +5341,9 @@ int amdgpu_device_pre_asic_reset(struct
> amdgpu_device *adev,
> >                         }
> >                 }
> >
> > -               if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
> > -                       dev_info(tmp_adev->dev, "Dumping IP State\n");
> > -                       /* Trigger ip dump before we reset the asic */
> > -                       for (i = 0; i < tmp_adev->num_ip_blocks; i++)
> > -                               if (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
> > -                                       tmp_adev->ip_blocks[i].version->funcs
> > -                                               ->dump_ip_state((void *)tmp_adev);
> > -                       dev_info(tmp_adev->dev, "Dumping IP State Completed\n");
> > -               }
> > +               if (amdgpu_gpu_coredump &&
> > +                   (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)))
> > +                       amdgpu_device_gpu_coredump_single(tmp_adev,
> > + job);
> >
> >                 if (need_full_reset)
> >                         r = amdgpu_device_ip_suspend(adev); @@
> > -5444,10 +5438,6 @@ int amdgpu_do_asic_reset(struct list_head
> *device_list_handle,
> >                                         goto out;
> >
> >                                 vram_lost =
> > amdgpu_device_check_vram_lost(tmp_adev);
> > -
> > -                               if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context-
> >flags))
> > -                                       amdgpu_coredump(tmp_adev, vram_lost,
> reset_context);
>
> This needs to stay here, otherwise, we won't know the status of vram_lost.
>

Ok, understood, seems vram_lost is the only dependence between core dump and GPU rest. Will not change the logic here.

> > -
> >                                 if (vram_lost) {
> >                                         DRM_INFO("VRAM is lost due to GPU reset!\n");
> >
> > amdgpu_inc_vram_lost(tmp_adev); diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index c6a1783fc9ef..63869820c334 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -48,6 +48,12 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >                 return DRM_GPU_SCHED_STAT_ENODEV;
> >         }
> >
> > +       /*
> > +        * Do the coredump immediately after a job timeout to get a closer
> > +        * representation of GPU's error status.
> > +        */
> > +       if (amdgpu_gpu_coredump)
> > +               amdgpu_device_gpu_coredump(adev, job);
>
> The problem with doing this here is that we miss core dumps for GPU resets
> that happen for reasons besides a user job timeout.  E.g., resets from KFD or a
> hang due to bad programming in a kernel context.

Ok, I will not change the logic in the code path of other reasons, but I want to add a control of the master switch, amdgpu_gpu_coredump, when doing dump in other code paths.

>
> If you want to keep this here, I'd suggest something like:
>
> if (!amdgpu_gpu_recovery)
>      amdgu_core_dump();
>
> That way you'll get a dump in both cases.  Maybe add a flag to skip printing
> vram_lost in this case since it happens before reset so it's irrelevant.
>
> Alex

Good idea, let me add a flag to skip printing vram_lost in job timeout scenario, but still printing vram_lost in other code patch, like KFD hang.
And as described before, there are several application scenarios, it seems only using amdgpu_gpu_recovery can not support all the scenarios

Regards,
Trigger
>
> >
> >         adev->job_hang = true;
> >
> > @@ -101,6 +107,10 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >                 reset_context.src = AMDGPU_RESET_SRC_JOB;
> >                 clear_bit(AMDGPU_NEED_FULL_RESET,
> > &reset_context.flags);
> >
> > +               /* To avoid a double coredump, as we have already done it */
> > +               if (amdgpu_gpu_coredump)
> > +                       set_bit(AMDGPU_SKIP_COREDUMP,
> > + &reset_context.flags);
> > +
> >                 r = amdgpu_device_gpu_recover(ring->adev, job, &reset_context);
> >                 if (r)
> >                         dev_err(adev->dev, "GPU Recovery Failed:
> > %d\n", r);
> > --
> > 2.34.1
> >