[AMD Official Use Only] > -----Original Message----- > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Yu, > Lang > Sent: Wednesday, December 1, 2021 7:37 PM > To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Christian König > <ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo > <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> > Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option > > [AMD Official Use Only] > > > > >-----Original Message----- > >From: Koenig, Christian <Christian.Koenig@xxxxxxx> > >Sent: Wednesday, December 1, 2021 7:29 PM > >To: Yu, Lang <Lang.Yu@xxxxxxx>; Christian König > ><ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo > ><Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> > >Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option > > > >Am 01.12.21 um 12:20 schrieb Yu, Lang: > >> [AMD Official Use Only] > >> > >>> -----Original Message----- > >>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > >>> Sent: Wednesday, December 1, 2021 6:49 PM > >>> To: Yu, Lang <Lang.Yu@xxxxxxx>; Koenig, Christian > >>> <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo > >>> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> > >>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option > >>> > >>> Am 01.12.21 um 11:44 schrieb Yu, Lang: > >>>> [AMD Official Use Only] > >>>> > >>>> > >>>> > >>>>> -----Original Message----- > >>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> > >>>>> Sent: Wednesday, December 1, 2021 5:30 PM > >>>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo > >>>>> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> > >>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option > >>>>> > >>>>> Am 01.12.21 um 10:24 schrieb Lang Yu: > >>>>>> To maintain system error state when SMU errors occurred, which > >>>>>> will aid in debugging SMU firmware issues, add SMU debug option > support. > >>>>>> > >>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file. > >>>>>> When enabled, it makes SMU errors fatal. > >>>>>> It is disabled by default. > >>>>>> > >>>>>> == Command Guide == > >>>>>> > >>>>>> 1, enable SMU debug option > >>>>>> > >>>>>> # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug > >>>>>> > >>>>>> 2, disable SMU debug option > >>>>>> > >>>>>> # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug > >>>>>> > >>>>>> v3: > >>>>>> - Use debugfs_create_bool().(Christian) > >>>>>> - Put variable into smu_context struct. > >>>>>> - Don't resend command when timeout. > >>>>>> > >>>>>> v2: > >>>>>> - Resend command when timeout.(Lijo) > >>>>>> - Use debugfs file instead of module parameter. > >>>>>> > >>>>>> Signed-off-by: Lang Yu <lang.yu@xxxxxxx> > >>>>> Well the debugfs part looks really nice and clean now, but one > >>>>> more comment below. > >>>>> > >>>>>> --- > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 3 +++ > >>>>>> drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 5 +++++ > >>>>>> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++ > >>>>>> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++++++- > >>>>>> 4 files changed, 17 insertions(+), 1 deletion(-) > >>>>>> > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > >>>>>> index 164d6a9e9fbb..86cd888c7822 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > >>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct > >>>>>> amdgpu_device > >>>>> *adev) > >>>>>> if (!debugfs_initialized()) > >>>>>> return 0; > >>>>>> > >>>>>> + debugfs_create_bool("amdgpu_smu_debug", 0600, root, > >>>>>> + &adev->smu.smu_debug_mode); > >>>>>> + > >>>>>> ent = debugfs_create_file("amdgpu_preempt_ib", 0600, > root, > >adev, > >>>>>> &fops_ib_preempt); > >>>>>> if (IS_ERR(ent)) { > >>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h > >>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h > >>>>>> index f738f7dc20c9..50dbf5594a9d 100644 > >>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h > >>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h > >>>>>> @@ -569,6 +569,11 @@ struct smu_context > >>>>>> struct smu_user_dpm_profile user_dpm_profile; > >>>>>> > >>>>>> struct stb_context stb_context; > >>>>>> + /* > >>>>>> + * When enabled, it makes SMU errors fatal. > >>>>>> + * (0 = disabled (default), 1 = enabled) > >>>>>> + */ > >>>>>> + bool smu_debug_mode; > >>>>>> }; > >>>>>> > >>>>>> struct i2c_adapter; > >>>>>> diff --git > a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c > >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c > >>>>>> index 6e781cee8bb6..d3797a2d6451 100644 > >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c > >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c > >>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct > >>>>> smu_context *smu) > >>>>>> out: > >>>>>> mutex_unlock(&smu->message_lock); > >>>>>> > >>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && ret); > >>>>>> + > >>>>>> return ret; > >>>>>> } > >>>>>> > >>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c > >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c > >>>>>> index 048ca1673863..9be005eb4241 100644 > >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c > >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c > >>>>>> @@ -349,15 +349,21 @@ int > >smu_cmn_send_smc_msg_with_param(struct > >>>>> smu_context *smu, > >>>>>> __smu_cmn_reg_print_error(smu, reg, index, param, > >msg); > >>>>>> goto Out; > >>>>>> } > >>>>>> + > >>>>>> __smu_cmn_send_msg(smu, (uint16_t) index, param); > >>>>>> reg = __smu_cmn_poll_stat(smu); > >>>>>> res = __smu_cmn_reg2errno(smu, reg); > >>>>>> - if (res != 0) > >>>>>> + if (res != 0) { > >>>>>> __smu_cmn_reg_print_error(smu, reg, index, param, > >msg); > >>>>>> + goto Out; > >>>>>> + } > >>>>>> if (read_arg) > >>>>>> smu_cmn_read_arg(smu, read_arg); > >>>>>> Out: > >>>>>> mutex_unlock(&smu->message_lock); > >>>>>> + > >>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && res); > >>>>> BUG_ON() really crashes the kernel and is only allowed if we > >>>>> prevent further data corruption with that. > >>>>> > >>>>> Most of the time WARN_ON() is more appropriate, but I can't fully > >>>>> judge here since I don't know the SMU code well enough. > >>>> This is what SMU FW guys want. They want "user-visible (potentially > >>>> fatal) > >>> errors", then a hang. > >>>> They want to keep system state since the error occurred. > >>> Well that is rather problematic. > >>> > >>> First of all we need to really justify that, crashing the kernel is > >>> not something easily done. > >>> > >>> Then this isn't really effective here. What happens is that you > >>> crash the kernel thread of the currently executing process, but it > >>> is perfectly possible that another thread still tries to send > >>> messages to the SMU. You need to have the BUG_ON() before dropping > >>> the lock to make sure that this really gets the driver stuck in the current > state. > >> Thanks. I got it. I just thought it is a kenel panic. > >> Could we use a panic() here? > > > >Potentially, but that might reboot the system automatically which is > >probably not what you want either. > > > >How does the SMU firmware team gather the necessary information when > a > >problem occurs? > > As far as I know, they usually use a HDT to collect information. > And they request a hang when error occurred in ticket. > "Suggested error responses include pop-up windows (by x86 driver, if this is > possible) or simply hanging after logging the error. " [Quan, Evan] Maybe what they want is just a stable SMU state(like no more message issuing after failure). If that's true, I think you can just bail out on __smu_cmn_poll_stat() failure(in smu_cmn_send_smc_msg_with_param() and smu_cmn_send_msg_without_waiting()). That will prevent further message issuing to SMU. BR Evan > > Regards, > Lang > > > > >When they connect external hardware a BUG_ON() while holding the SMU > >lock might work, but if they want to SSH into the system you should > >probably rather set a flag that the hardware shouldn't be touched any > >more by the driver and continue. > > > >Otherwise the box is most likely really unstable after this. > > > >Regards, > >Christian. > > > >> > >> Regards, > >> Lang > >> > >>> Regards, > >>> Christian. > >>> > >>>> Regards, > >>>> Lang > >>>> > >>>>> Christian. > >>>>> > >>>>>> + > >>>>>> return res; > >>>>>> } > >>>>>>