RE: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Alex Deucher <alexdeucher@xxxxxxxxx>
> Sent: Friday, August 16, 2024 12:02 AM
> To: Huang, Trigger <Trigger.Huang@xxxxxxx>
> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Khatri, Sunil <Sunil.Khatri@xxxxxxx>;
> Deucher, Alexander <Alexander.Deucher@xxxxxxx>
> Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
>
> On Thu, Aug 15, 2024 at 7:39 AM <Trigger.Huang@xxxxxxx> wrote:
> >
> > From: Trigger Huang <Trigger.Huang@xxxxxxx>
> >
> > Add new separate parameter to control GPU coredump procedure. This can
> > be used to decouple the coredump procedure from gpu recovery procedure
> >
> > Signed-off-by: Trigger Huang <Trigger.Huang@xxxxxxx>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
> >  2 files changed, 9 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 937de21a7142..4dd465ad14af 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training;  extern
> > int amdgpu_lbpw;  extern int amdgpu_compute_multipipe;  extern int
> > amdgpu_gpu_recovery;
> > +extern int amdgpu_gpu_coredump;
> >  extern int amdgpu_emu_mode;
> >  extern uint amdgpu_smu_memory_pool_size;  extern int
> > amdgpu_smu_pptable_id; diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index b9529948f2b2..c5d357420236 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -178,6 +178,7 @@ uint amdgpu_force_long_training;  int
> amdgpu_lbpw
> > = -1;  int amdgpu_compute_multipipe = -1;  int amdgpu_gpu_recovery =
> > -1; /* auto */
> > +int amdgpu_gpu_coredump;
> >  int amdgpu_emu_mode;
> >  uint amdgpu_smu_memory_pool_size;
> >  int amdgpu_smu_pptable_id = -1;
> > @@ -556,6 +557,13 @@ module_param_named(compute_multipipe,
> > amdgpu_compute_multipipe, int, 0444);
> MODULE_PARM_DESC(gpu_recovery,
> > "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 =
> > auto)");  module_param_named(gpu_recovery, amdgpu_gpu_recovery, int,
> > 0444);
> >
> > +/**
> > + * DOC: gpu_coredump (int)
> > + * Set to enable GPU coredump mechanism (1 = enable, 0 = disable).
> > +The default is 0  */ MODULE_PARM_DESC(gpu_coredump, "Enable GPU
> > +coredump mechanism, (1 = enable, 0 = disable (default))");
> > +module_param_named(gpu_coredump, amdgpu_gpu_coredump, int,
> 0444);
>
> I don't think we need a separate parameter for this, although if we do, this
> would need to be enabled by default.  If it needs to be decoupled from reset,
> that's fine, but I don't see the need for a separate knob.
>
> Alex

Hi Alex,
It is fine to enable it by default
There are several application scenarios that I can think of.
        1, Customer may need to do the core dump with gpu_recovery disabled. This can be used for GPU hang debug
        2, Customer may need to disable the core dump with gpu_recovery enabled. This can be used for quick GPU recovery, especially for some lightweight hangs that can be processed by soft recovery or per ring reset.
        3, Customer may need to enable the core dump with gpu_recovery enabled. This can be used for GPU recovery but record the core dump for further check in stress test or system health check.
It seems not easy to support all the scenarios by only using amdgpu_gpu_coredump, right?

Regards,
Trigger
>
> > +
> >  /**
> >   * DOC: emu_mode (int)
> >   * Set value 1 to enable emulation mode. This is only needed when running
> on an emulator. The default is 0 (disabled).
> > --
> > 2.34.1
> >




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux