On 12/10/ , Christian KKKnig wrote: > Am 10.12.21 um 04:21 schrieb Lang Yu: > > On 12/10/ , Quan, Evan wrote: > > > [AMD Official Use Only] > > > > > > > > > > > > > -----Original Message----- > > > > From: Yu, Lang <Lang.Yu@xxxxxxx> > > > > Sent: Friday, December 10, 2021 10:34 AM > > > > To: Quan, Evan <Evan.Quan@xxxxxxx> > > > > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Grodzovsky, Andrey > > > > <Andrey.Grodzovsky@xxxxxxx>; Lazar, Lijo <Lijo.Lazar@xxxxxxx>; Huang, > > > > Ray <Ray.Huang@xxxxxxx>; Deucher, Alexander > > > > <Alexander.Deucher@xxxxxxx>; Koenig, Christian > > > > <Christian.Koenig@xxxxxxx> > > > > Subject: Re: [PATCH 2/2] drm/amdgpu: add support for SMU debug option > > > > > > > > On 12/10/ , Quan, Evan wrote: > > > > > [AMD Official Use Only] > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of > > > > > > Lang Yu > > > > > > Sent: Thursday, December 9, 2021 4:49 PM > > > > > > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > > > > > Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Lazar, Lijo > > > > > > <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx>; Deucher, > > > > > > Alexander <Alexander.Deucher@xxxxxxx>; Yu, Lang > > > > <Lang.Yu@xxxxxxx>; > > > > > > Koenig, Christian <Christian.Koenig@xxxxxxx> > > > > > > Subject: [PATCH 2/2] drm/amdgpu: add support for SMU debug option > > > > > > > > > > > > SMU firmware guys expect the driver maintains error context and > > > > > > doesn't interact with SMU any more when SMU errors occurred. > > > > > > That will aid in debugging SMU firmware issues. > > > > > > > > > > > > Add SMU debug option support for this request, it can be enabled or > > > > > > disabled via amdgpu_smu_debug debugfs file. > > > > > > When enabled, it brings hardware to a kind of halt state so that no > > > > > > one can touch it any more in the envent of SMU errors. > > > > > > > > > > > > Currently, dirver interacts with SMU via sending messages. > > > > > > And threre are three ways to sending messages to SMU. > > > > > > Handle them respectively as following: > > > > > > > > > > > > 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases > > > > > > > > > > > > Halt on any error. > > > > > > > > > > > > 2, > > > > smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response() > > > > > > for longer timeout cases > > > > > > > > > > > > Halt on errors apart from ETIME. Otherwise this way won't work. > > > > > > > > > > > > 3, smu_cmn_send_msg_without_waiting() for no waiting cases > > > > > > > > > > > > Halt on errors apart from ETIME. Otherwise second way won't work. > > > > > > > > > > > > After halting, use BUG() to explicitly notify users. > > > > > > > > > > > > == Command Guide == > > > > > > > > > > > > 1, enable SMU debug option > > > > > > > > > > > > # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug > > > > > > > > > > > > 2, disable SMU debug option > > > > > > > > > > > > # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug > > > > > > > > > > > > v4: > > > > > > - Set to halt state instead of a simple hang.(Christian) > > > > > > > > > > > > v3: > > > > > > - Use debugfs_create_bool().(Christian) > > > > > > - Put variable into smu_context struct. > > > > > > - Don't resend command when timeout. > > > > > > > > > > > > v2: > > > > > > - Resend command when timeout.(Lijo) > > > > > > - Use debugfs file instead of module parameter. > > > > > > > > > > > > Signed-off-by: Lang Yu <lang.yu@xxxxxxx> > > > > > > --- > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 3 +++ > > > > > > drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 5 +++++ > > > > > > drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 20 > > > > > > +++++++++++++++++++- > > > > > > 3 files changed, 27 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > > > > > > index 164d6a9e9fbb..86cd888c7822 100644 > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > > > > > > @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct > > > > amdgpu_device > > > > > > *adev) > > > > > > if (!debugfs_initialized()) > > > > > > return 0; > > > > > > > > > > > > + debugfs_create_bool("amdgpu_smu_debug", 0600, root, > > > > > > + &adev->smu.smu_debug_mode); > > > > > > + > > > > > > ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev, > > > > > > &fops_ib_preempt); > > > > > > if (IS_ERR(ent)) { > > > > > > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h > > > > > > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h > > > > > > index f738f7dc20c9..50dbf5594a9d 100644 > > > > > > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h > > > > > > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h > > > > > > @@ -569,6 +569,11 @@ struct smu_context > > > > > > struct smu_user_dpm_profile user_dpm_profile; > > > > > > > > > > > > struct stb_context stb_context; > > > > > > + /* > > > > > > + * When enabled, it makes SMU errors fatal. > > > > > > + * (0 = disabled (default), 1 = enabled) > > > > > > + */ > > > > > > + bool smu_debug_mode; > > > > > [Quan, Evan] Can you expand this to bit mask(as ppfeaturemask)? So that > > > > in future we can add support for other debug features. > > > > > > }; > > > > OK. > > > > > > > > > > struct i2c_adapter; > > > > > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c > > > > > > b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c > > > > > > index 048ca1673863..84016d22c075 100644 > > > > > > --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c > > > > > > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c > > > > > > @@ -272,6 +272,11 @@ int smu_cmn_send_msg_without_waiting(struct > > > > > > smu_context *smu, > > > > > > __smu_cmn_send_msg(smu, msg_index, param); > > > > > > res = 0; > > > > > > Out: > > > > > > + if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) { > > > > > > + amdgpu_device_halt(smu->adev); > > > > > > + BUG(); > > > > > [Quan, Evan] I agree amdgpu_device_halt() is a good idea. Christian and > > > > Andrey can share you more insights about that. > > > > > Do we still need the "BUG()" then? > > > > The BUG() is used to explicitly notify users something went wrong. > > > > Otherwise userspace may not know immediately. > > > > FW guys request this in ticket. > > > [Quan, Evan] Won't drm_dev_unplug() and pci_disable_device() used in amdgpu_device_halt throw some errors(on user's further attempt to communicate with our driver)? > > > Also if the purpose is to raise user's concern, WARN() may be a more gentle way? > > From my testing and observation, it depends on what the driver will do next. > > Probably trigger a page fault. If we don't connect a monitor but SSH access. > > We don't know what happended until something like a page fault is triggered. > > > > But here what I want to do is throwing the error immediately to userspace. > > If using WARN(), the user need to poll dmesg to see if something went wrong. > > I agree with Evan as well. Please use WARN() or WARN_ON() here instead. > > BUG() and BUG_ON() is only allowed when you prevent further data corruption > by intentionally crashing the kernel thread. But that is really invasive > because for example locks are not necessarily dropped and so can crash the > whole system. Ok, will use WARN(). Regards, Lang > Regards, > Christian. > > > > > Regards, > > Lang > > > > > BR > > > Evan > > > > Regards, > > > > Lang > > > > > > > > > BR > > > > > Evan > > > > > > + } > > > > > > + > > > > > > return res; > > > > > > } > > > > > > > > > > > > @@ -288,9 +293,17 @@ int smu_cmn_send_msg_without_waiting(struct > > > > > > smu_context *smu, > > > > > > int smu_cmn_wait_for_response(struct smu_context *smu) { > > > > > > u32 reg; > > > > > > + int res; > > > > > > > > > > > > reg = __smu_cmn_poll_stat(smu); > > > > > > - return __smu_cmn_reg2errno(smu, reg); > > > > > > + res = __smu_cmn_reg2errno(smu, reg); > > > > > > + > > > > > > + if (unlikely(smu->smu_debug_mode) && res && (res != -ETIME)) { > > > > > > + amdgpu_device_halt(smu->adev); > > > > > > + BUG(); > > > > > > + } > > > > > > + > > > > > > + return res; > > > > > > } > > > > > > > > > > > > /** > > > > > > @@ -357,6 +370,11 @@ int > > > > smu_cmn_send_smc_msg_with_param(struct > > > > > > smu_context *smu, > > > > > > if (read_arg) > > > > > > smu_cmn_read_arg(smu, read_arg); > > > > > > Out: > > > > > > + if (unlikely(smu->smu_debug_mode) && res) { > > > > > > + amdgpu_device_halt(smu->adev); > > > > > > + BUG(); > > > > > > + } > > > > > > + > > > > > > mutex_unlock(&smu->message_lock); > > > > > > return res; > > > > > > } > > > > > > -- > > > > > > 2.25.1 >