[AMD Official Use Only] >-----Original Message----- >From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >Sent: Wednesday, December 1, 2021 6:49 PM >To: Yu, Lang <Lang.Yu@xxxxxxx>; Koenig, Christian ><Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo ><Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> >Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option > >Am 01.12.21 um 11:44 schrieb Yu, Lang: >> [AMD Official Use Only] >> >> >> >>> -----Original Message----- >>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>> Sent: Wednesday, December 1, 2021 5:30 PM >>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo >>> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx> >>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option >>> >>> Am 01.12.21 um 10:24 schrieb Lang Yu: >>>> To maintain system error state when SMU errors occurred, which will >>>> aid in debugging SMU firmware issues, add SMU debug option support. >>>> >>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file. >>>> When enabled, it makes SMU errors fatal. >>>> It is disabled by default. >>>> >>>> == Command Guide == >>>> >>>> 1, enable SMU debug option >>>> >>>> # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug >>>> >>>> 2, disable SMU debug option >>>> >>>> # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug >>>> >>>> v3: >>>> - Use debugfs_create_bool().(Christian) >>>> - Put variable into smu_context struct. >>>> - Don't resend command when timeout. >>>> >>>> v2: >>>> - Resend command when timeout.(Lijo) >>>> - Use debugfs file instead of module parameter. >>>> >>>> Signed-off-by: Lang Yu <lang.yu@xxxxxxx> >>> Well the debugfs part looks really nice and clean now, but one more >>> comment below. >>> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 3 +++ >>>> drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 5 +++++ >>>> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++ >>>> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++++++- >>>> 4 files changed, 17 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>> index 164d6a9e9fbb..86cd888c7822 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device >>> *adev) >>>> if (!debugfs_initialized()) >>>> return 0; >>>> >>>> + debugfs_create_bool("amdgpu_smu_debug", 0600, root, >>>> + &adev->smu.smu_debug_mode); >>>> + >>>> ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev, >>>> &fops_ib_preempt); >>>> if (IS_ERR(ent)) { >>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h >>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h >>>> index f738f7dc20c9..50dbf5594a9d 100644 >>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h >>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h >>>> @@ -569,6 +569,11 @@ struct smu_context >>>> struct smu_user_dpm_profile user_dpm_profile; >>>> >>>> struct stb_context stb_context; >>>> + /* >>>> + * When enabled, it makes SMU errors fatal. >>>> + * (0 = disabled (default), 1 = enabled) >>>> + */ >>>> + bool smu_debug_mode; >>>> }; >>>> >>>> struct i2c_adapter; >>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c >>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c >>>> index 6e781cee8bb6..d3797a2d6451 100644 >>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c >>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c >>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct >>> smu_context *smu) >>>> out: >>>> mutex_unlock(&smu->message_lock); >>>> >>>> + BUG_ON(unlikely(smu->smu_debug_mode) && ret); >>>> + >>>> return ret; >>>> } >>>> >>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>> index 048ca1673863..9be005eb4241 100644 >>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct >>> smu_context *smu, >>>> __smu_cmn_reg_print_error(smu, reg, index, param, msg); >>>> goto Out; >>>> } >>>> + >>>> __smu_cmn_send_msg(smu, (uint16_t) index, param); >>>> reg = __smu_cmn_poll_stat(smu); >>>> res = __smu_cmn_reg2errno(smu, reg); >>>> - if (res != 0) >>>> + if (res != 0) { >>>> __smu_cmn_reg_print_error(smu, reg, index, param, msg); >>>> + goto Out; >>>> + } >>>> if (read_arg) >>>> smu_cmn_read_arg(smu, read_arg); >>>> Out: >>>> mutex_unlock(&smu->message_lock); >>>> + >>>> + BUG_ON(unlikely(smu->smu_debug_mode) && res); >>> BUG_ON() really crashes the kernel and is only allowed if we prevent >>> further data corruption with that. >>> >>> Most of the time WARN_ON() is more appropriate, but I can't fully >>> judge here since I don't know the SMU code well enough. >> This is what SMU FW guys want. They want "user-visible (potentially fatal) >errors", then a hang. >> They want to keep system state since the error occurred. > >Well that is rather problematic. > >First of all we need to really justify that, crashing the kernel is not something >easily done. > >Then this isn't really effective here. What happens is that you crash the kernel >thread of the currently executing process, but it is perfectly possible that another >thread still tries to send messages to the SMU. You need to have the BUG_ON() >before dropping the lock to make sure that this really gets the driver stuck in the >current state. Thanks. I got it. I just thought it is a kenel panic. Could we use a panic() here? Regards, Lang >Regards, >Christian. > >> >> Regards, >> Lang >> >>> Christian. >>> >>>> + >>>> return res; >>>> } >>>>