RE: [PATCH 2/2] drm/amdgpu: Do core dump immediately when job tmo

"Huang, Trigger" <Trigger.Huang@xxxxxxx> · Tue, 20 Aug 2024 07:30:44 +0000

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Khatri, Sunil <Sunil.Khatri@xxxxxxx>
> Sent: Monday, August 19, 2024 6:31 PM
> To: Huang, Trigger <Trigger.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>
> Subject: Re: [PATCH 2/2] drm/amdgpu: Do core dump immediately when job
> tmo
>
>
> On 8/19/2024 3:23 PM, Trigger.Huang@xxxxxxx wrote:
> > From: Trigger Huang <Trigger.Huang@xxxxxxx>
> >
> > Do the coredump immediately after a job timeout to get a closer
> > representation of GPU's error status.
> >
> > V2: This will skip printing vram_lost as the GPU reset is not happened
> > yet (Alex)
> >
> > V3: Unconditionally call the core dump as we care about all the reset
> > functions(soft-recovery and queue reset and full adapter reset, Alex)
> >
> > Signed-off-by: Trigger Huang <Trigger.Huang@xxxxxxx>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 62
> +++++++++++++++++++++++++
> >   1 file changed, 62 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index c6a1783fc9ef..ebbb1434073e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -30,6 +30,61 @@
> >   #include "amdgpu.h"
> >   #include "amdgpu_trace.h"
> >   #include "amdgpu_reset.h"
> > +#include "amdgpu_dev_coredump.h"
> > +#include "amdgpu_xgmi.h"
> > +
> > +static void amdgpu_job_do_core_dump(struct amdgpu_device *adev,
> > +                               struct amdgpu_job *job)
> > +{
> > +   int i;
> > +
> > +   dev_info(adev->dev, "Dumping IP State\n");
> > +   for (i = 0; i < adev->num_ip_blocks; i++) {
> > +           if (adev->ip_blocks[i].version->funcs->dump_ip_state)
> > +                   adev->ip_blocks[i].version->funcs
> > +                           ->dump_ip_state((void *)adev);
> > +           dev_info(adev->dev, "Dumping IP State Completed\n");
> > +   }
> > +
> > +   amdgpu_coredump(adev, true, false, job); }
> > +
> > +static void amdgpu_job_core_dump(struct amdgpu_device *adev,
> > +                            struct amdgpu_job *job)
> > +{
> > +   struct list_head device_list, *device_list_handle =  NULL;
> > +   struct amdgpu_device *tmp_adev = NULL;
> > +   struct amdgpu_hive_info *hive = NULL;
> > +
> > +   if (!amdgpu_sriov_vf(adev))
> > +           hive = amdgpu_get_xgmi_hive(adev);
> > +   if (hive)
> > +           mutex_lock(&hive->hive_lock);
> > +   /*
> > +    * Reuse the logic in amdgpu_device_gpu_recover() to build list of
> > +    * devices for code dump
> > +    */
> > +   INIT_LIST_HEAD(&device_list);
> > +   if (!amdgpu_sriov_vf(adev) && (adev-
> >gmc.xgmi.num_physical_nodes > 1) && hive) {
> > +           list_for_each_entry(tmp_adev, &hive->device_list,
> gmc.xgmi.head)
> > +                   list_add_tail(&tmp_adev->reset_list, &device_list);
> > +           if (!list_is_first(&adev->reset_list, &device_list))
> > +                   list_rotate_to_front(&adev->reset_list, &device_list);
> > +           device_list_handle = &device_list;
> > +   } else {
> > +           list_add_tail(&adev->reset_list, &device_list);
> > +           device_list_handle = &device_list;
> > +   }
> > +
> > +   /* Do the coredump for each device */
> > +   list_for_each_entry(tmp_adev, device_list_handle, reset_list)
> > +           amdgpu_job_do_core_dump(tmp_adev, job);
> > +
> > +   if (hive) {
> > +           mutex_unlock(&hive->hive_lock);
> > +           amdgpu_put_xgmi_hive(hive);
> > +   }
> > +}
> >
> >   static enum drm_gpu_sched_stat amdgpu_job_timedout(struct
> drm_sched_job *s_job)
> >   {
> > @@ -48,6 +103,7 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >             return DRM_GPU_SCHED_STAT_ENODEV;
> >     }
> >
> > +   amdgpu_job_core_dump(adev, job);
> The philosophy is hang and recovery is to let the HW and software try to
> recover. Here we try to do a soft recovery first and i think we should wait for
> seft recovery and if fails then we do dump and thats exactly we are doing here.

Hi Sunil ,
thanks for the suggestion, and that's reasonable. But my concern is that after soft recovery happened, the GPU's status may change(take gfx 9 for example, it will try to kill the current hang wave)
 Actually, in most cases, a real shader hang cannot be resolved through soft recovery, and at that moment, we need to get a very close dump/snapshot/representation of GPU's current error status.
Just like the scandump, when we trying to do a scandump for a shader hang, we will disable gpu_recovery, and no soft recovery/per-queue reset/HW reset will happen before the scandump, right?
On most products, there are no scandump interfaces, so core dump is even more important for debugging GPU hang issue.

Regards,
Trigger

>
> Also we need to make sure that the tasks which are already in queue are put
> on hold and the their sync points are signalled before we dump.
> check once what all steps are taken before we dump in the current
> implementation.

Do you mean sometimes like:
        drm_sched_wqueue_stop(&ring->sched);
        amdgpu_fence_driver_force_completion(ring); // Since there is no GPU reset happened, is it reasonable to call it here?
        amdgpu_job_core_dump(adev, job);

Regards,
Trigger

>
> Regards
>
> Sunil khatri
>
> >
> >     adev->job_hang = true;
> >
> > @@ -101,6 +157,12 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >             reset_context.src = AMDGPU_RESET_SRC_JOB;
> >             clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
> >
> > +           /*
> > +            * To avoid an unnecessary extra coredump, as we have
> already
> > +            * got the very close representation of GPU's error status
> > +            */
> > +           set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
> > +
> >             r = amdgpu_device_gpu_recover(ring->adev, job,
> &reset_context);
> >             if (r)
> >                     dev_err(adev->dev, "GPU Recovery Failed: %d\n", r);