[AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of > Huang, Trigger > Sent: Friday, August 16, 2024 2:36 PM > To: Alex Deucher <alexdeucher@xxxxxxxxx> > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Khatri, Sunil <Sunil.Khatri@xxxxxxx>; > Deucher, Alexander <Alexander.Deucher@xxxxxxx> > Subject: RE: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter > > [AMD Official Use Only - AMD Internal Distribution Only] > > [AMD Official Use Only - AMD Internal Distribution Only] > > > -----Original Message----- > > From: Alex Deucher <alexdeucher@xxxxxxxxx> > > Sent: Friday, August 16, 2024 12:02 AM > > To: Huang, Trigger <Trigger.Huang@xxxxxxx> > > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Khatri, Sunil > > <Sunil.Khatri@xxxxxxx>; Deucher, Alexander > <Alexander.Deucher@xxxxxxx> > > Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter > > > > On Thu, Aug 15, 2024 at 7:39 AM <Trigger.Huang@xxxxxxx> wrote: > > > > > > From: Trigger Huang <Trigger.Huang@xxxxxxx> > > > > > > Add new separate parameter to control GPU coredump procedure. This > > > can be used to decouple the coredump procedure from gpu recovery > > > procedure > > > > > > Signed-off-by: Trigger Huang <Trigger.Huang@xxxxxxx> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + > > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++ > > > 2 files changed, 9 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > index 937de21a7142..4dd465ad14af 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > @@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training; extern > > > int amdgpu_lbpw; extern int amdgpu_compute_multipipe; extern int > > > amdgpu_gpu_recovery; > > > +extern int amdgpu_gpu_coredump; > > > extern int amdgpu_emu_mode; > > > extern uint amdgpu_smu_memory_pool_size; extern int > > > amdgpu_smu_pptable_id; diff --git > > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > index b9529948f2b2..c5d357420236 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > @@ -178,6 +178,7 @@ uint amdgpu_force_long_training; int > > amdgpu_lbpw > > > = -1; int amdgpu_compute_multipipe = -1; int amdgpu_gpu_recovery = > > > -1; /* auto */ > > > +int amdgpu_gpu_coredump; > > > int amdgpu_emu_mode; > > > uint amdgpu_smu_memory_pool_size; > > > int amdgpu_smu_pptable_id = -1; > > > @@ -556,6 +557,13 @@ module_param_named(compute_multipipe, > > > amdgpu_compute_multipipe, int, 0444); > > MODULE_PARM_DESC(gpu_recovery, > > > "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = > > > auto)"); module_param_named(gpu_recovery, amdgpu_gpu_recovery, > int, > > > 0444); > > > > > > +/** > > > + * DOC: gpu_coredump (int) > > > + * Set to enable GPU coredump mechanism (1 = enable, 0 = disable). > > > +The default is 0 */ MODULE_PARM_DESC(gpu_coredump, "Enable GPU > > > +coredump mechanism, (1 = enable, 0 = disable (default))"); > > > +module_param_named(gpu_coredump, amdgpu_gpu_coredump, int, > > 0444); > > > > I don't think we need a separate parameter for this, although if we > > do, this would need to be enabled by default. If it needs to be > > decoupled from reset, that's fine, but I don't see the need for a separate > knob. > > > > Alex > > Hi Alex, > It is fine to enable it by default > There are several application scenarios that I can think of. > 1, Customer may need to do the core dump with gpu_recovery disabled. > This can be used for GPU hang debug > 2, Customer may need to disable the core dump with gpu_recovery > enabled. This can be used for quick GPU recovery, especially for some > lightweight hangs that can be processed by soft recovery or per ring reset. > 3, Customer may need to enable the core dump with gpu_recovery > enabled. This can be used for GPU recovery but record the core dump for > further check in stress test or system health check. > It seems not easy to support all the scenarios by only using > amdgpu_gpu_coredump, right? Sorry, this is a typo, here I mean amdgpu_gpu_recovery, while not amdgpu_gpu_coredump. > > Regards, > Trigger > > > > > + > > > /** > > > * DOC: emu_mode (int) > > > * Set value 1 to enable emulation mode. This is only needed when > > > running > > on an emulator. The default is 0 (disabled). > > > -- > > > 2.34.1 > > >