[AMD Official Use Only] >-----Original Message----- >From: Quan, Evan <Evan.Quan@xxxxxxx> >Sent: Thursday, December 2, 2021 10:48 AM >To: Yu, Lang <Lang.Yu@xxxxxxx>; Koenig, Christian ><Christian.Koenig@xxxxxxx>; Christian König ><ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo ><Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> >Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option > >[AMD Official Use Only] > > > >> -----Original Message----- >> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Yu, >> Lang >> Sent: Wednesday, December 1, 2021 7:37 PM >> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Christian König >> <ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo >> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> >> Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option >> >> [AMD Official Use Only] >> >> >> >> >-----Original Message----- >> >From: Koenig, Christian <Christian.Koenig@xxxxxxx> >> >Sent: Wednesday, December 1, 2021 7:29 PM >> >To: Yu, Lang <Lang.Yu@xxxxxxx>; Christian König >> ><ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> >Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo >> ><Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> >> >Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option >> > >> >Am 01.12.21 um 12:20 schrieb Yu, Lang: >> >> [AMD Official Use Only] >> >> >> >>> -----Original Message----- >> >>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >> >>> Sent: Wednesday, December 1, 2021 6:49 PM >> >>> To: Yu, Lang <Lang.Yu@xxxxxxx>; Koenig, Christian >> >>> <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo >> >>> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> >> >>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option >> >>> >> >>> Am 01.12.21 um 11:44 schrieb Yu, Lang: >> >>>> [AMD Official Use Only] >> >>>> >> >>>> >> >>>> >> >>>>> -----Original Message----- >> >>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >> >>>>> Sent: Wednesday, December 1, 2021 5:30 PM >> >>>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> >>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo >> >>>>> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> >> >>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option >> >>>>> >> >>>>> Am 01.12.21 um 10:24 schrieb Lang Yu: >> >>>>>> To maintain system error state when SMU errors occurred, which >> >>>>>> will aid in debugging SMU firmware issues, add SMU debug option >> support. >> >>>>>> >> >>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file. >> >>>>>> When enabled, it makes SMU errors fatal. >> >>>>>> It is disabled by default. >> >>>>>> >> >>>>>> == Command Guide == >> >>>>>> >> >>>>>> 1, enable SMU debug option >> >>>>>> >> >>>>>> # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug >> >>>>>> >> >>>>>> 2, disable SMU debug option >> >>>>>> >> >>>>>> # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug >> >>>>>> >> >>>>>> v3: >> >>>>>> - Use debugfs_create_bool().(Christian) >> >>>>>> - Put variable into smu_context struct. >> >>>>>> - Don't resend command when timeout. >> >>>>>> >> >>>>>> v2: >> >>>>>> - Resend command when timeout.(Lijo) >> >>>>>> - Use debugfs file instead of module parameter. >> >>>>>> >> >>>>>> Signed-off-by: Lang Yu <lang.yu@xxxxxxx> >> >>>>> Well the debugfs part looks really nice and clean now, but one >> >>>>> more comment below. >> >>>>> >> >>>>>> --- >> >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 3 +++ >> >>>>>> drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 5 +++++ >> >>>>>> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++ >> >>>>>> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++++++- >> >>>>>> 4 files changed, 17 insertions(+), 1 deletion(-) >> >>>>>> >> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> >>>>>> index 164d6a9e9fbb..86cd888c7822 100644 >> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> >>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct >> >>>>>> amdgpu_device >> >>>>> *adev) >> >>>>>> if (!debugfs_initialized()) >> >>>>>> return 0; >> >>>>>> >> >>>>>> + debugfs_create_bool("amdgpu_smu_debug", 0600, root, >> >>>>>> + &adev->smu.smu_debug_mode); >> >>>>>> + >> >>>>>> ent = debugfs_create_file("amdgpu_preempt_ib", 0600, >> root, >> >adev, >> >>>>>> &fops_ib_preempt); >> >>>>>> if (IS_ERR(ent)) { >> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h >> >>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h >> >>>>>> index f738f7dc20c9..50dbf5594a9d 100644 >> >>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h >> >>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h >> >>>>>> @@ -569,6 +569,11 @@ struct smu_context >> >>>>>> struct smu_user_dpm_profile user_dpm_profile; >> >>>>>> >> >>>>>> struct stb_context stb_context; >> >>>>>> + /* >> >>>>>> + * When enabled, it makes SMU errors fatal. >> >>>>>> + * (0 = disabled (default), 1 = enabled) >> >>>>>> + */ >> >>>>>> + bool smu_debug_mode; >> >>>>>> }; >> >>>>>> >> >>>>>> struct i2c_adapter; >> >>>>>> diff --git >> a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c >> >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c >> >>>>>> index 6e781cee8bb6..d3797a2d6451 100644 >> >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c >> >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c >> >>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct >> >>>>> smu_context *smu) >> >>>>>> out: >> >>>>>> mutex_unlock(&smu->message_lock); >> >>>>>> >> >>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && ret); >> >>>>>> + >> >>>>>> return ret; >> >>>>>> } >> >>>>>> >> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >> >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >> >>>>>> index 048ca1673863..9be005eb4241 100644 >> >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >> >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >> >>>>>> @@ -349,15 +349,21 @@ int >> >smu_cmn_send_smc_msg_with_param(struct >> >>>>> smu_context *smu, >> >>>>>> __smu_cmn_reg_print_error(smu, reg, index, param, >> >msg); >> >>>>>> goto Out; >> >>>>>> } >> >>>>>> + >> >>>>>> __smu_cmn_send_msg(smu, (uint16_t) index, param); >> >>>>>> reg = __smu_cmn_poll_stat(smu); >> >>>>>> res = __smu_cmn_reg2errno(smu, reg); >> >>>>>> - if (res != 0) >> >>>>>> + if (res != 0) { >> >>>>>> __smu_cmn_reg_print_error(smu, reg, index, param, >> >msg); >> >>>>>> + goto Out; >> >>>>>> + } >> >>>>>> if (read_arg) >> >>>>>> smu_cmn_read_arg(smu, read_arg); >> >>>>>> Out: >> >>>>>> mutex_unlock(&smu->message_lock); >> >>>>>> + >> >>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && res); >> >>>>> BUG_ON() really crashes the kernel and is only allowed if we >> >>>>> prevent further data corruption with that. >> >>>>> >> >>>>> Most of the time WARN_ON() is more appropriate, but I can't >> >>>>> fully judge here since I don't know the SMU code well enough. >> >>>> This is what SMU FW guys want. They want "user-visible >> >>>> (potentially >> >>>> fatal) >> >>> errors", then a hang. >> >>>> They want to keep system state since the error occurred. >> >>> Well that is rather problematic. >> >>> >> >>> First of all we need to really justify that, crashing the kernel >> >>> is not something easily done. >> >>> >> >>> Then this isn't really effective here. What happens is that you >> >>> crash the kernel thread of the currently executing process, but it >> >>> is perfectly possible that another thread still tries to send >> >>> messages to the SMU. You need to have the BUG_ON() before dropping >> >>> the lock to make sure that this really gets the driver stuck in >> >>> the current >> state. >> >> Thanks. I got it. I just thought it is a kenel panic. >> >> Could we use a panic() here? >> > >> >Potentially, but that might reboot the system automatically which is >> >probably not what you want either. >> > >> >How does the SMU firmware team gather the necessary information when >> a >> >problem occurs? >> >> As far as I know, they usually use a HDT to collect information. >> And they request a hang when error occurred in ticket. >> "Suggested error responses include pop-up windows (by x86 driver, if >> this is >> possible) or simply hanging after logging the error. " >[Quan, Evan] Maybe what they want is just a stable SMU state(like no more >message issuing after failure). >If that's true, I think you can just bail out on __smu_cmn_poll_stat() failure(in >smu_cmn_send_smc_msg_with_param() and >smu_cmn_send_msg_without_waiting()). >That will prevent further message issuing to SMU. But it's difficult to distinguish normal timeout cases(see aldebaran_mode2_reset()). Sometimes it's reasonable to return a timeout. The user wants to use a longer timeout. Regards, Lang > >BR >Evan >> >> Regards, >> Lang >> >> > >> >When they connect external hardware a BUG_ON() while holding the SMU >> >lock might work, but if they want to SSH into the system you should >> >probably rather set a flag that the hardware shouldn't be touched any >> >more by the driver and continue. >> > >> >Otherwise the box is most likely really unstable after this. >> > >> >Regards, >> >Christian. >> > >> >> >> >> Regards, >> >> Lang >> >> >> >>> Regards, >> >>> Christian. >> >>> >> >>>> Regards, >> >>>> Lang >> >>>> >> >>>>> Christian. >> >>>>> >> >>>>>> + >> >>>>>> return res; >> >>>>>> } >> >>>>>>