[AMD Official Use Only] >-----Original Message----- >From: Koenig, Christian <Christian.Koenig@xxxxxxx> >Sent: Wednesday, December 1, 2021 7:29 PM >To: Yu, Lang <Lang.Yu@xxxxxxx>; Christian König ><ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo ><Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> >Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option > >Am 01.12.21 um 12:20 schrieb Yu, Lang: >> [AMD Official Use Only] >> >>> -----Original Message----- >>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >>> Sent: Wednesday, December 1, 2021 6:49 PM >>> To: Yu, Lang <Lang.Yu@xxxxxxx>; Koenig, Christian >>> <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo >>> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> >>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option >>> >>> Am 01.12.21 um 11:44 schrieb Yu, Lang: >>>> [AMD Official Use Only] >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>> Sent: Wednesday, December 1, 2021 5:30 PM >>>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo >>>>> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> >>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option >>>>> >>>>> Am 01.12.21 um 10:24 schrieb Lang Yu: >>>>>> To maintain system error state when SMU errors occurred, which >>>>>> will aid in debugging SMU firmware issues, add SMU debug option support. >>>>>> >>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file. >>>>>> When enabled, it makes SMU errors fatal. >>>>>> It is disabled by default. >>>>>> >>>>>> == Command Guide == >>>>>> >>>>>> 1, enable SMU debug option >>>>>> >>>>>> # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug >>>>>> >>>>>> 2, disable SMU debug option >>>>>> >>>>>> # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug >>>>>> >>>>>> v3: >>>>>> - Use debugfs_create_bool().(Christian) >>>>>> - Put variable into smu_context struct. >>>>>> - Don't resend command when timeout. >>>>>> >>>>>> v2: >>>>>> - Resend command when timeout.(Lijo) >>>>>> - Use debugfs file instead of module parameter. >>>>>> >>>>>> Signed-off-by: Lang Yu <lang.yu@xxxxxxx> >>>>> Well the debugfs part looks really nice and clean now, but one more >>>>> comment below. >>>>> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 3 +++ >>>>>> drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 5 +++++ >>>>>> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++ >>>>>> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++++++- >>>>>> 4 files changed, 17 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>>>> index 164d6a9e9fbb..86cd888c7822 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device >>>>> *adev) >>>>>> if (!debugfs_initialized()) >>>>>> return 0; >>>>>> >>>>>> + debugfs_create_bool("amdgpu_smu_debug", 0600, root, >>>>>> + &adev->smu.smu_debug_mode); >>>>>> + >>>>>> ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, >adev, >>>>>> &fops_ib_preempt); >>>>>> if (IS_ERR(ent)) { >>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h >>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h >>>>>> index f738f7dc20c9..50dbf5594a9d 100644 >>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h >>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h >>>>>> @@ -569,6 +569,11 @@ struct smu_context >>>>>> struct smu_user_dpm_profile user_dpm_profile; >>>>>> >>>>>> struct stb_context stb_context; >>>>>> + /* >>>>>> + * When enabled, it makes SMU errors fatal. >>>>>> + * (0 = disabled (default), 1 = enabled) >>>>>> + */ >>>>>> + bool smu_debug_mode; >>>>>> }; >>>>>> >>>>>> struct i2c_adapter; >>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c >>>>>> index 6e781cee8bb6..d3797a2d6451 100644 >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c >>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct >>>>> smu_context *smu) >>>>>> out: >>>>>> mutex_unlock(&smu->message_lock); >>>>>> >>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && ret); >>>>>> + >>>>>> return ret; >>>>>> } >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>>>> index 048ca1673863..9be005eb4241 100644 >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>>>> @@ -349,15 +349,21 @@ int >smu_cmn_send_smc_msg_with_param(struct >>>>> smu_context *smu, >>>>>> __smu_cmn_reg_print_error(smu, reg, index, param, >msg); >>>>>> goto Out; >>>>>> } >>>>>> + >>>>>> __smu_cmn_send_msg(smu, (uint16_t) index, param); >>>>>> reg = __smu_cmn_poll_stat(smu); >>>>>> res = __smu_cmn_reg2errno(smu, reg); >>>>>> - if (res != 0) >>>>>> + if (res != 0) { >>>>>> __smu_cmn_reg_print_error(smu, reg, index, param, >msg); >>>>>> + goto Out; >>>>>> + } >>>>>> if (read_arg) >>>>>> smu_cmn_read_arg(smu, read_arg); >>>>>> Out: >>>>>> mutex_unlock(&smu->message_lock); >>>>>> + >>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && res); >>>>> BUG_ON() really crashes the kernel and is only allowed if we >>>>> prevent further data corruption with that. >>>>> >>>>> Most of the time WARN_ON() is more appropriate, but I can't fully >>>>> judge here since I don't know the SMU code well enough. >>>> This is what SMU FW guys want. They want "user-visible (potentially >>>> fatal) >>> errors", then a hang. >>>> They want to keep system state since the error occurred. >>> Well that is rather problematic. >>> >>> First of all we need to really justify that, crashing the kernel is >>> not something easily done. >>> >>> Then this isn't really effective here. What happens is that you crash >>> the kernel thread of the currently executing process, but it is >>> perfectly possible that another thread still tries to send messages >>> to the SMU. You need to have the BUG_ON() before dropping the lock to >>> make sure that this really gets the driver stuck in the current state. >> Thanks. I got it. I just thought it is a kenel panic. >> Could we use a panic() here? > >Potentially, but that might reboot the system automatically which is probably not >what you want either. > >How does the SMU firmware team gather the necessary information when a >problem occurs? As far as I know, they usually use a HDT to collect information. And they request a hang when error occurred in ticket. "Suggested error responses include pop-up windows (by x86 driver, if this is possible) or simply hanging after logging the error. " Regards, Lang > >When they connect external hardware a BUG_ON() while holding the SMU lock >might work, but if they want to SSH into the system you should probably rather >set a flag that the hardware shouldn't be touched any more by the driver and >continue. > >Otherwise the box is most likely really unstable after this. > >Regards, >Christian. > >> >> Regards, >> Lang >> >>> Regards, >>> Christian. >>> >>>> Regards, >>>> Lang >>>> >>>>> Christian. >>>>> >>>>>> + >>>>>> return res; >>>>>> } >>>>>>