RE: [PATCH 4/4] drm/amdgpu: Do core dump immediately when job tmo

"Huang, Trigger" <Trigger.Huang@xxxxxxx> · Mon, 19 Aug 2024 09:37:05 +0000

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Alex Deucher <alexdeucher@xxxxxxxxx>
> Sent: Friday, August 16, 2024 9:59 PM
> To: Huang, Trigger <Trigger.Huang@xxxxxxx>
> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Khatri, Sunil <Sunil.Khatri@xxxxxxx>;
> Deucher, Alexander <Alexander.Deucher@xxxxxxx>
> Subject: Re: [PATCH 4/4] drm/amdgpu: Do core dump immediately when job
> tmo
>
> On Fri, Aug 16, 2024 at 4:05 AM <Trigger.Huang@xxxxxxx> wrote:
> >
> > From: Trigger Huang <Trigger.Huang@xxxxxxx>
> >
> > Do the coredump immediately after a job timeout to get a closer
> > representation of GPU's error status.
> >
> > V2: This will skip printing vram_lost as the GPU reset is not happened
> > yet (Alex)
> >
> > Signed-off-by: Trigger Huang <Trigger.Huang@xxxxxxx>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 64
> > +++++++++++++++++++++++++
> >  1 file changed, 64 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index c6a1783fc9ef..009551335d05 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -28,9 +28,63 @@
> >  #include <drm/drm_drv.h>
> >
> >  #include "amdgpu.h"
> > +#include "amdgpu_dev_coredump.h"
> > +#include "amdgpu_xgmi.h"
> >  #include "amdgpu_trace.h"
> >  #include "amdgpu_reset.h"
> >
> > +static void amdgpu_job_do_core_dump(struct amdgpu_device *adev,
> > +                              struct amdgpu_job *job) {
> > +       int i = 0;
> > +
> > +       dev_info(adev->dev, "Dumping IP State\n");
> > +       for (i = 0; i < adev->num_ip_blocks; i++) {
> > +               if (adev->ip_blocks[i].version->funcs->dump_ip_state)
> > +                       adev->ip_blocks[i].version->funcs
> > +                               ->dump_ip_state((void *)adev);
> > +               dev_info(adev->dev, "Dumping IP State Completed\n");
> > +       }
> > +
> > +       amdgpu_coredump(adev, true, false, job); }
> > +
> > +static void amdgpu_job_core_dump(struct amdgpu_device *adev, struct
> > +amdgpu_job *job) {
> > +       struct list_head device_list, *device_list_handle =  NULL;
> > +       struct amdgpu_device *tmp_adev = NULL;
> > +       struct amdgpu_hive_info *hive = NULL;
> > +
> > +       if (!amdgpu_sriov_vf(adev))
> > +               hive = amdgpu_get_xgmi_hive(adev);
> > +       if (hive)
> > +               mutex_lock(&hive->hive_lock);
> > +       /*
> > +        * Reuse the logic in amdgpu_device_gpu_recover() to build list of
> > +        * devices for code dump
> > +        */
> > +       INIT_LIST_HEAD(&device_list);
> > +       if (!amdgpu_sriov_vf(adev) && (adev-
> >gmc.xgmi.num_physical_nodes > 1) && hive) {
> > +               list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
> > +                       list_add_tail(&tmp_adev->reset_list, &device_list);
> > +               if (!list_is_first(&adev->reset_list, &device_list))
> > +                       list_rotate_to_front(&adev->reset_list, &device_list);
> > +               device_list_handle = &device_list;
> > +       } else {
> > +               list_add_tail(&adev->reset_list, &device_list);
> > +               device_list_handle = &device_list;
> > +       }
> > +
> > +       /* Do the coredump for each device */
> > +       list_for_each_entry(tmp_adev, device_list_handle, reset_list)
> > +               amdgpu_job_do_core_dump(tmp_adev, job);
> > +
> > +       if (hive) {
> > +               mutex_unlock(&hive->hive_lock);
> > +               amdgpu_put_xgmi_hive(hive);
> > +       }
> > +}
> > +
> >  static enum drm_gpu_sched_stat amdgpu_job_timedout(struct
> > drm_sched_job *s_job)  {
> >         struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched); @@
> > -48,6 +102,12 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >                 return DRM_GPU_SCHED_STAT_ENODEV;
> >         }
> >
> > +       /*
> > +        * Do the coredump immediately after a job timeout to get a closer
> > +        * representation of GPU's error status.
> > +        */
> > +       if (amdgpu_gpu_coredump)
> > +               amdgpu_job_core_dump(adev, job);
> >
> >         adev->job_hang = true;
> >
> > @@ -101,6 +161,10 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >                 reset_context.src = AMDGPU_RESET_SRC_JOB;
> >                 clear_bit(AMDGPU_NEED_FULL_RESET,
> > &reset_context.flags);
> >
> > +               /* To avoid a double coredump, as we have already done it */
> > +               if (amdgpu_gpu_coredump)
> > +                       set_bit(AMDGPU_SKIP_COREDUMP,
> > + &reset_context.flags);
> > +
>
> I could go either way on this hunk.  Do you see any problems with a double
> core dump?  We are effectively doing two resets...

I prefer to do it once, as we have already got the very close representation of GPU's error status.
What's more, dumping the IP status again may cost several milliseconds as well as requiring extra MMIO access for reading all the registers.

Regards,
Trigger
>
> Alex
>
> >                 r = amdgpu_device_gpu_recover(ring->adev, job, &reset_context);
> >                 if (r)
> >                         dev_err(adev->dev, "GPU Recovery Failed:
> > %d\n", r);
> > --
> > 2.34.1
> >