On Fri, Aug 16, 2024 at 2:36 AM Huang, Trigger <Trigger.Huang@xxxxxxx> wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > -----Original Message----- > > From: Alex Deucher <alexdeucher@xxxxxxxxx> > > Sent: Friday, August 16, 2024 12:02 AM > > To: Huang, Trigger <Trigger.Huang@xxxxxxx> > > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Khatri, Sunil <Sunil.Khatri@xxxxxxx>; > > Deucher, Alexander <Alexander.Deucher@xxxxxxx> > > Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter > > > > On Thu, Aug 15, 2024 at 7:39 AM <Trigger.Huang@xxxxxxx> wrote: > > > > > > From: Trigger Huang <Trigger.Huang@xxxxxxx> > > > > > > Add new separate parameter to control GPU coredump procedure. This can > > > be used to decouple the coredump procedure from gpu recovery procedure > > > > > > Signed-off-by: Trigger Huang <Trigger.Huang@xxxxxxx> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + > > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++ > > > 2 files changed, 9 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > index 937de21a7142..4dd465ad14af 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > @@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training; extern > > > int amdgpu_lbpw; extern int amdgpu_compute_multipipe; extern int > > > amdgpu_gpu_recovery; > > > +extern int amdgpu_gpu_coredump; > > > extern int amdgpu_emu_mode; > > > extern uint amdgpu_smu_memory_pool_size; extern int > > > amdgpu_smu_pptable_id; diff --git > > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > index b9529948f2b2..c5d357420236 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > @@ -178,6 +178,7 @@ uint amdgpu_force_long_training; int > > amdgpu_lbpw > > > = -1; int amdgpu_compute_multipipe = -1; int amdgpu_gpu_recovery = > > > -1; /* auto */ > > > +int amdgpu_gpu_coredump; > > > int amdgpu_emu_mode; > > > uint amdgpu_smu_memory_pool_size; > > > int amdgpu_smu_pptable_id = -1; > > > @@ -556,6 +557,13 @@ module_param_named(compute_multipipe, > > > amdgpu_compute_multipipe, int, 0444); > > MODULE_PARM_DESC(gpu_recovery, > > > "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = > > > auto)"); module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, > > > 0444); > > > > > > +/** > > > + * DOC: gpu_coredump (int) > > > + * Set to enable GPU coredump mechanism (1 = enable, 0 = disable). > > > +The default is 0 */ MODULE_PARM_DESC(gpu_coredump, "Enable GPU > > > +coredump mechanism, (1 = enable, 0 = disable (default))"); > > > +module_param_named(gpu_coredump, amdgpu_gpu_coredump, int, > > 0444); > > > > I don't think we need a separate parameter for this, although if we do, this > > would need to be enabled by default. If it needs to be decoupled from reset, > > that's fine, but I don't see the need for a separate knob. > > > > Alex > > Hi Alex, > It is fine to enable it by default > There are several application scenarios that I can think of. > 1, Customer may need to do the core dump with gpu_recovery disabled. This can be used for GPU hang debug > 2, Customer may need to disable the core dump with gpu_recovery enabled. This can be used for quick GPU recovery, especially for some lightweight hangs that can be processed by soft recovery or per ring reset. > 3, Customer may need to enable the core dump with gpu_recovery enabled. This can be used for GPU recovery but record the core dump for further check in stress test or system health check. > It seems not easy to support all the scenarios by only using amdgpu_gpu_coredump, right? We always want devcoredump enabled by default for full adapter resets, otherwise it kind of defeats the purpose of the feature. Do we want devcoredumps for cases where we can recover via soft recovery or per queue reset? If we mainly care about full adapter reset then we can do something like: 1. in amdgpu_job_timedout(), we can do: if (!amdgpu_gpu_recovery) amdgpu_dev_coredump() between per ring reset and full adapter reset. That way it won't get called for soft recovery or per queue reset. 2. leave the current dev_coredump code in place for the case when recovery is enabled. If we want it for both soft-recovery and queue reset and full adapter reset, then we just just unconditionally call it in amdgpu_job_timedout(). If the finer grained resets don't work, we'll get two dumps, but I think that's probably ok. Alex > > Regards, > Trigger > > > > > + > > > /** > > > * DOC: emu_mode (int) > > > * Set value 1 to enable emulation mode. This is only needed when running > > on an emulator. The default is 0 (disabled). > > > -- > > > 2.34.1 > > >