[AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Alex Deucher <alexdeucher@xxxxxxxxx> > Sent: Friday, August 16, 2024 9:43 PM > To: Huang, Trigger <Trigger.Huang@xxxxxxx> > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Khatri, Sunil <Sunil.Khatri@xxxxxxx>; > Deucher, Alexander <Alexander.Deucher@xxxxxxx> > Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter > > On Fri, Aug 16, 2024 at 2:36 AM Huang, Trigger <Trigger.Huang@xxxxxxx> > wrote: > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > -----Original Message----- > > > From: Alex Deucher <alexdeucher@xxxxxxxxx> > > > Sent: Friday, August 16, 2024 12:02 AM > > > To: Huang, Trigger <Trigger.Huang@xxxxxxx> > > > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Khatri, Sunil > > > <Sunil.Khatri@xxxxxxx>; Deucher, Alexander > > > <Alexander.Deucher@xxxxxxx> > > > Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter > > > > > > On Thu, Aug 15, 2024 at 7:39 AM <Trigger.Huang@xxxxxxx> wrote: > > > > > > > > From: Trigger Huang <Trigger.Huang@xxxxxxx> > > > > > > > > Add new separate parameter to control GPU coredump procedure. This > > > > can be used to decouple the coredump procedure from gpu recovery > > > > procedure > > > > > > > > Signed-off-by: Trigger Huang <Trigger.Huang@xxxxxxx> > > > > --- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++ > > > > 2 files changed, 9 insertions(+) > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > > index 937de21a7142..4dd465ad14af 100644 > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > > > @@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training; > > > > extern int amdgpu_lbpw; extern int amdgpu_compute_multipipe; > > > > extern int amdgpu_gpu_recovery; > > > > +extern int amdgpu_gpu_coredump; > > > > extern int amdgpu_emu_mode; > > > > extern uint amdgpu_smu_memory_pool_size; extern int > > > > amdgpu_smu_pptable_id; diff --git > > > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > > index b9529948f2b2..c5d357420236 100644 > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > > @@ -178,6 +178,7 @@ uint amdgpu_force_long_training; int > > > amdgpu_lbpw > > > > = -1; int amdgpu_compute_multipipe = -1; int amdgpu_gpu_recovery > > > > = -1; /* auto */ > > > > +int amdgpu_gpu_coredump; > > > > int amdgpu_emu_mode; > > > > uint amdgpu_smu_memory_pool_size; int amdgpu_smu_pptable_id = > > > > -1; @@ -556,6 +557,13 @@ > module_param_named(compute_multipipe, > > > > amdgpu_compute_multipipe, int, 0444); > > > MODULE_PARM_DESC(gpu_recovery, > > > > "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = > > > > auto)"); module_param_named(gpu_recovery, amdgpu_gpu_recovery, > > > > int, 0444); > > > > > > > > +/** > > > > + * DOC: gpu_coredump (int) > > > > + * Set to enable GPU coredump mechanism (1 = enable, 0 = disable). > > > > +The default is 0 */ MODULE_PARM_DESC(gpu_coredump, "Enable GPU > > > > +coredump mechanism, (1 = enable, 0 = disable (default))"); > > > > +module_param_named(gpu_coredump, amdgpu_gpu_coredump, int, > > > 0444); > > > > > > I don't think we need a separate parameter for this, although if we > > > do, this would need to be enabled by default. If it needs to be > > > decoupled from reset, that's fine, but I don't see the need for a separate > knob. > > > > > > Alex > > > > Hi Alex, > > It is fine to enable it by default > > There are several application scenarios that I can think of. > > 1, Customer may need to do the core dump with gpu_recovery disabled. > This can be used for GPU hang debug > > 2, Customer may need to disable the core dump with gpu_recovery > enabled. This can be used for quick GPU recovery, especially for some > lightweight hangs that can be processed by soft recovery or per ring reset. > > 3, Customer may need to enable the core dump with gpu_recovery > enabled. This can be used for GPU recovery but record the core dump for > further check in stress test or system health check. > > It seems not easy to support all the scenarios by only using > amdgpu_gpu_coredump, right? > > We always want devcoredump enabled by default for full adapter resets, > otherwise it kind of defeats the purpose of the feature. Do we want > devcoredumps for cases where we can recover via soft recovery or per queue > reset? If we mainly care about full adapter reset then we can do something > like: > > 1. in amdgpu_job_timedout(), we can do: > if (!amdgpu_gpu_recovery) > amdgpu_dev_coredump() > between per ring reset and full adapter reset. That way it won't get called for > soft recovery or per queue reset. > > 2. leave the current dev_coredump code in place for the case when recovery is > enabled. > > If we want it for both soft-recovery and queue reset and full adapter reset, > then we just just unconditionally call it in amdgpu_job_timedout(). If the finer > grained resets don't work, we'll get two dumps, but I think that's probably ok. OK, that sounds reasonable. Let me drop the new parameter in the new patch. And I think we want it for all(soft recovery+ queue reset+ full adapter reset) in job timeout scenario , and my only concern is that we need to do the coredump immediately after a job timeout to get a closer representation of GPU's error status. For other cases/scenarios, I will leave the current logic unchanged in the new patch Thanks, Trigger > > Alex > > > > > Regards, > > Trigger > > > > > > > + > > > > /** > > > > * DOC: emu_mode (int) > > > > * Set value 1 to enable emulation mode. This is only needed when > > > > running > > > on an emulator. The default is 0 (disabled). > > > > -- > > > > 2.34.1 > > > >