RE: [PATCH] drm/amdgpu: add support to SMU debug option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only]



> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Yu,
> Lang
> Sent: Wednesday, December 1, 2021 7:37 PM
> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Christian König
> <ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo
> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx>
> Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
> 
> [AMD Official Use Only]
> 
> 
> 
> >-----Original Message-----
> >From: Koenig, Christian <Christian.Koenig@xxxxxxx>
> >Sent: Wednesday, December 1, 2021 7:29 PM
> >To: Yu, Lang <Lang.Yu@xxxxxxx>; Christian König
> ><ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo
> ><Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx>
> >Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
> >
> >Am 01.12.21 um 12:20 schrieb Yu, Lang:
> >> [AMD Official Use Only]
> >>
> >>> -----Original Message-----
> >>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
> >>> Sent: Wednesday, December 1, 2021 6:49 PM
> >>> To: Yu, Lang <Lang.Yu@xxxxxxx>; Koenig, Christian
> >>> <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo
> >>> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx>
> >>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
> >>>
> >>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
> >>>> [AMD Official Use Only]
> >>>>
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
> >>>>> Sent: Wednesday, December 1, 2021 5:30 PM
> >>>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo
> >>>>> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx>
> >>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
> >>>>>
> >>>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
> >>>>>> To maintain system error state when SMU errors occurred, which
> >>>>>> will aid in debugging SMU firmware issues, add SMU debug option
> support.
> >>>>>>
> >>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
> >>>>>> When enabled, it makes SMU errors fatal.
> >>>>>> It is disabled by default.
> >>>>>>
> >>>>>> == Command Guide ==
> >>>>>>
> >>>>>> 1, enable SMU debug option
> >>>>>>
> >>>>>>     # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> >>>>>>
> >>>>>> 2, disable SMU debug option
> >>>>>>
> >>>>>>     # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> >>>>>>
> >>>>>> v3:
> >>>>>>     - Use debugfs_create_bool().(Christian)
> >>>>>>     - Put variable into smu_context struct.
> >>>>>>     - Don't resend command when timeout.
> >>>>>>
> >>>>>> v2:
> >>>>>>     - Resend command when timeout.(Lijo)
> >>>>>>     - Use debugfs file instead of module parameter.
> >>>>>>
> >>>>>> Signed-off-by: Lang Yu <lang.yu@xxxxxxx>
> >>>>> Well the debugfs part looks really nice and clean now, but one
> >>>>> more comment below.
> >>>>>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
> >>>>>>     drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
> >>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
> >>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
> >>>>>>     4 files changed, 17 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >>>>>> index 164d6a9e9fbb..86cd888c7822 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
> >>>>>> amdgpu_device
> >>>>> *adev)
> >>>>>>     	if (!debugfs_initialized())
> >>>>>>     		return 0;
> >>>>>>
> >>>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> >>>>>> +				  &adev->smu.smu_debug_mode);
> >>>>>> +
> >>>>>>     	ent = debugfs_create_file("amdgpu_preempt_ib", 0600,
> root,
> >adev,
> >>>>>>     				  &fops_ib_preempt);
> >>>>>>     	if (IS_ERR(ent)) {
> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>>>>> index f738f7dc20c9..50dbf5594a9d 100644
> >>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>>>>> @@ -569,6 +569,11 @@ struct smu_context
> >>>>>>     	struct smu_user_dpm_profile user_dpm_profile;
> >>>>>>
> >>>>>>     	struct stb_context stb_context;
> >>>>>> +	/*
> >>>>>> +	 * When enabled, it makes SMU errors fatal.
> >>>>>> +	 * (0 = disabled (default), 1 = enabled)
> >>>>>> +	 */
> >>>>>> +	bool smu_debug_mode;
> >>>>>>     };
> >>>>>>
> >>>>>>     struct i2c_adapter;
> >>>>>> diff --git
> a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>>>>> index 6e781cee8bb6..d3797a2d6451 100644
> >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
> >>>>> smu_context *smu)
> >>>>>>     out:
> >>>>>>     	mutex_unlock(&smu->message_lock);
> >>>>>>
> >>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
> >>>>>> +
> >>>>>>     	return ret;
> >>>>>>     }
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>> index 048ca1673863..9be005eb4241 100644
> >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>> @@ -349,15 +349,21 @@ int
> >smu_cmn_send_smc_msg_with_param(struct
> >>>>> smu_context *smu,
> >>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
> >msg);
> >>>>>>     		goto Out;
> >>>>>>     	}
> >>>>>> +
> >>>>>>     	__smu_cmn_send_msg(smu, (uint16_t) index, param);
> >>>>>>     	reg = __smu_cmn_poll_stat(smu);
> >>>>>>     	res = __smu_cmn_reg2errno(smu, reg);
> >>>>>> -	if (res != 0)
> >>>>>> +	if (res != 0) {
> >>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
> >msg);
> >>>>>> +		goto Out;
> >>>>>> +	}
> >>>>>>     	if (read_arg)
> >>>>>>     		smu_cmn_read_arg(smu, read_arg);
> >>>>>>     Out:
> >>>>>>     	mutex_unlock(&smu->message_lock);
> >>>>>> +
> >>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
> >>>>> BUG_ON() really crashes the kernel and is only allowed if we
> >>>>> prevent further data corruption with that.
> >>>>>
> >>>>> Most of the time WARN_ON() is more appropriate, but I can't fully
> >>>>> judge here since I don't know the SMU code well enough.
> >>>> This is what SMU FW guys want. They want "user-visible (potentially
> >>>> fatal)
> >>> errors", then a hang.
> >>>> They want to keep system state since the error occurred.
> >>> Well that is rather problematic.
> >>>
> >>> First of all we need to really justify that, crashing the kernel is
> >>> not something easily done.
> >>>
> >>> Then this isn't really effective here. What happens is that you
> >>> crash the kernel thread of the currently executing process, but it
> >>> is perfectly possible that another thread still tries to send
> >>> messages to the SMU. You need to have the BUG_ON() before dropping
> >>> the lock to make sure that this really gets the driver stuck in the current
> state.
> >> Thanks. I got it. I just thought it is a kenel panic.
> >> Could we use a panic() here?
> >
> >Potentially, but that might reboot the system automatically which is
> >probably not what you want either.
> >
> >How does the SMU firmware team gather the necessary information when
> a
> >problem occurs?
> 
> As far as I know, they usually use a HDT to collect information.
> And they request a hang when error occurred in ticket.
> "Suggested error responses include pop-up windows (by x86 driver, if this is
> possible) or simply hanging after logging the error. "
[Quan, Evan] Maybe what they want is just a stable SMU state(like no more message issuing after failure).
If that's true, I think you can just bail out on __smu_cmn_poll_stat() failure(in smu_cmn_send_smc_msg_with_param() and smu_cmn_send_msg_without_waiting()).
That will prevent further message issuing to SMU.

BR
Evan
> 
> Regards,
> Lang
> 
> >
> >When they connect external hardware a BUG_ON() while holding the SMU
> >lock might work, but if they want to SSH into the system you should
> >probably rather set a flag that the hardware shouldn't be touched any
> >more by the driver and continue.
> >
> >Otherwise the box is most likely really unstable after this.
> >
> >Regards,
> >Christian.
> >
> >>
> >> Regards,
> >> Lang
> >>
> >>> Regards,
> >>> Christian.
> >>>
> >>>> Regards,
> >>>> Lang
> >>>>
> >>>>> Christian.
> >>>>>
> >>>>>> +
> >>>>>>     	return res;
> >>>>>>     }
> >>>>>>




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux