Re: [PATCH 3/3] drm/amdgpu: Change the timing of doing coredump

Alex Deucher <alexdeucher@xxxxxxxxx> · Thu, 15 Aug 2024 12:09:27 -0400

On Thu, Aug 15, 2024 at 7:50 AM <Trigger.Huang@xxxxxxx> wrote:
>
> From: Trigger Huang <Trigger.Huang@xxxxxxx>
>
> Do the coredump immediately after a job timeout to get a closer
> representation of GPU's error status. For other code paths that
> need to do the coredump, keep the original logic unchanged, except:
> 1,All the coredump operations will be under the control of parameter
> amdgpu_gpu_coredump
> 2, Do the ip dump and register to dev codedump system together.
>
> Signed-off-by: Trigger Huang <Trigger.Huang@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++-------------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 10 ++++++++++
>  2 files changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 69d6a5b7ca34..a8eb76d82921 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5341,15 +5341,9 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>                         }
>                 }
>
> -               if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
> -                       dev_info(tmp_adev->dev, "Dumping IP State\n");
> -                       /* Trigger ip dump before we reset the asic */
> -                       for (i = 0; i < tmp_adev->num_ip_blocks; i++)
> -                               if (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
> -                                       tmp_adev->ip_blocks[i].version->funcs
> -                                               ->dump_ip_state((void *)tmp_adev);
> -                       dev_info(tmp_adev->dev, "Dumping IP State Completed\n");
> -               }
> +               if (amdgpu_gpu_coredump &&
> +                   (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)))
> +                       amdgpu_device_gpu_coredump_single(tmp_adev, job);
>
>                 if (need_full_reset)
>                         r = amdgpu_device_ip_suspend(adev);
> @@ -5444,10 +5438,6 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>                                         goto out;
>
>                                 vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
> -
> -                               if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags))
> -                                       amdgpu_coredump(tmp_adev, vram_lost, reset_context);

This needs to stay here, otherwise, we won't know the status of vram_lost.

> -
>                                 if (vram_lost) {
>                                         DRM_INFO("VRAM is lost due to GPU reset!\n");
>                                         amdgpu_inc_vram_lost(tmp_adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index c6a1783fc9ef..63869820c334 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -48,6 +48,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>                 return DRM_GPU_SCHED_STAT_ENODEV;
>         }
>
> +       /*
> +        * Do the coredump immediately after a job timeout to get a closer
> +        * representation of GPU's error status.
> +        */
> +       if (amdgpu_gpu_coredump)
> +               amdgpu_device_gpu_coredump(adev, job);

The problem with doing this here is that we miss core dumps for GPU
resets that happen for reasons besides a user job timeout.  E.g.,
resets from KFD or a hang due to bad programming in a kernel context.

If you want to keep this here, I'd suggest something like:

if (!amdgpu_gpu_recovery)
     amdgu_core_dump();

That way you'll get a dump in both cases.  Maybe add a flag to skip
printing vram_lost in this case since it happens before reset so it's
irrelevant.

Alex

>
>         adev->job_hang = true;
>
> @@ -101,6 +107,10 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>                 reset_context.src = AMDGPU_RESET_SRC_JOB;
>                 clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>
> +               /* To avoid a double coredump, as we have already done it */
> +               if (amdgpu_gpu_coredump)
> +                       set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
> +
>                 r = amdgpu_device_gpu_recover(ring->adev, job, &reset_context);
>                 if (r)
>                         dev_err(adev->dev, "GPU Recovery Failed: %d\n", r);
> --
> 2.34.1
>