RE: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter

"Huang, Trigger" <Trigger.Huang@xxxxxxx> · Mon, 19 Aug 2024 09:29:00 +0000

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Alex Deucher <alexdeucher@xxxxxxxxx>
> Sent: Friday, August 16, 2024 9:43 PM
> To: Huang, Trigger <Trigger.Huang@xxxxxxx>
> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Khatri, Sunil <Sunil.Khatri@xxxxxxx>;
> Deucher, Alexander <Alexander.Deucher@xxxxxxx>
> Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
>
> On Fri, Aug 16, 2024 at 2:36 AM Huang, Trigger <Trigger.Huang@xxxxxxx>
> wrote:
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > > -----Original Message-----
> > > From: Alex Deucher <alexdeucher@xxxxxxxxx>
> > > Sent: Friday, August 16, 2024 12:02 AM
> > > To: Huang, Trigger <Trigger.Huang@xxxxxxx>
> > > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Khatri, Sunil
> > > <Sunil.Khatri@xxxxxxx>; Deucher, Alexander
> > > <Alexander.Deucher@xxxxxxx>
> > > Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
> > >
> > > On Thu, Aug 15, 2024 at 7:39 AM <Trigger.Huang@xxxxxxx> wrote:
> > > >
> > > > From: Trigger Huang <Trigger.Huang@xxxxxxx>
> > > >
> > > > Add new separate parameter to control GPU coredump procedure. This
> > > > can be used to decouple the coredump procedure from gpu recovery
> > > > procedure
> > > >
> > > > Signed-off-by: Trigger Huang <Trigger.Huang@xxxxxxx>
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
> > > >  2 files changed, 9 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > index 937de21a7142..4dd465ad14af 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > @@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training;
> > > > extern int amdgpu_lbpw;  extern int amdgpu_compute_multipipe;
> > > > extern int amdgpu_gpu_recovery;
> > > > +extern int amdgpu_gpu_coredump;
> > > >  extern int amdgpu_emu_mode;
> > > >  extern uint amdgpu_smu_memory_pool_size;  extern int
> > > > amdgpu_smu_pptable_id; diff --git
> > > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > index b9529948f2b2..c5d357420236 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > @@ -178,6 +178,7 @@ uint amdgpu_force_long_training;  int
> > > amdgpu_lbpw
> > > > = -1;  int amdgpu_compute_multipipe = -1;  int amdgpu_gpu_recovery
> > > > = -1; /* auto */
> > > > +int amdgpu_gpu_coredump;
> > > >  int amdgpu_emu_mode;
> > > >  uint amdgpu_smu_memory_pool_size;  int amdgpu_smu_pptable_id =
> > > > -1; @@ -556,6 +557,13 @@
> module_param_named(compute_multipipe,
> > > > amdgpu_compute_multipipe, int, 0444);
> > > MODULE_PARM_DESC(gpu_recovery,
> > > > "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 =
> > > > auto)");  module_param_named(gpu_recovery, amdgpu_gpu_recovery,
> > > > int, 0444);
> > > >
> > > > +/**
> > > > + * DOC: gpu_coredump (int)
> > > > + * Set to enable GPU coredump mechanism (1 = enable, 0 = disable).
> > > > +The default is 0  */ MODULE_PARM_DESC(gpu_coredump, "Enable GPU
> > > > +coredump mechanism, (1 = enable, 0 = disable (default))");
> > > > +module_param_named(gpu_coredump, amdgpu_gpu_coredump, int,
> > > 0444);
> > >
> > > I don't think we need a separate parameter for this, although if we
> > > do, this would need to be enabled by default.  If it needs to be
> > > decoupled from reset, that's fine, but I don't see the need for a separate
> knob.
> > >
> > > Alex
> >
> > Hi Alex,
> > It is fine to enable it by default
> > There are several application scenarios that I can think of.
> >         1, Customer may need to do the core dump with gpu_recovery disabled.
> This can be used for GPU hang debug
> >         2, Customer may need to disable the core dump with gpu_recovery
> enabled. This can be used for quick GPU recovery, especially for some
> lightweight hangs that can be processed by soft recovery or per ring reset.
> >         3, Customer may need to enable the core dump with gpu_recovery
> enabled. This can be used for GPU recovery but record the core dump for
> further check in stress test or system health check.
> > It seems not easy to support all the scenarios by only using
> amdgpu_gpu_coredump, right?
>
> We always want devcoredump enabled by default for full adapter resets,
> otherwise it kind of defeats the purpose of the feature.  Do we want
> devcoredumps for cases where we can recover via soft recovery or per queue
> reset?  If we mainly care about full adapter reset then we can do something
> like:
>
> 1. in amdgpu_job_timedout(), we can do:
> if (!amdgpu_gpu_recovery)
>     amdgpu_dev_coredump()
> between per ring reset and full adapter reset.  That way it won't get called for
> soft recovery or per queue reset.
>
> 2. leave the current dev_coredump code in place for the case when recovery is
> enabled.
>
> If we want it for both soft-recovery and queue reset and full adapter reset,
> then we just just unconditionally call it in amdgpu_job_timedout().  If the finer
> grained resets don't work, we'll get two dumps, but I think that's probably ok.

OK, that sounds reasonable. Let me drop the new parameter in the new patch.
And I think we want it for all(soft recovery+ queue reset+ full adapter reset) in job timeout scenario , and my only concern is that we need to do the coredump immediately after a job timeout to get a closer representation of GPU's error status.
For other cases/scenarios, I will leave the current logic unchanged in the new patch

Thanks,
Trigger
>
> Alex
>
> >
> > Regards,
> > Trigger
> > >
> > > > +
> > > >  /**
> > > >   * DOC: emu_mode (int)
> > > >   * Set value 1 to enable emulation mode. This is only needed when
> > > > running
> > > on an emulator. The default is 0 (disabled).
> > > > --
> > > > 2.34.1
> > > >