RE: [PATCH] drm/amdgpu: add support to SMU debug option

"Yu, Lang" <Lang.Yu@xxxxxxx> · Wed, 1 Dec 2021 11:37:24 +0000

[AMD Official Use Only]

>-----Original Message-----
>From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>Sent: Wednesday, December 1, 2021 7:29 PM
>To: Yu, Lang <Lang.Yu@xxxxxxx>; Christian König
><ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo
><Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx>
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>Am 01.12.21 um 12:20 schrieb Yu, Lang:
>> [AMD Official Use Only]
>>
>>> -----Original Message-----
>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
>>> Sent: Wednesday, December 1, 2021 6:49 PM
>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; Koenig, Christian
>>> <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo
>>> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx>
>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>
>>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
>>>> [AMD Official Use Only]
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>>> Sent: Wednesday, December 1, 2021 5:30 PM
>>>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Lazar, Lijo
>>>>> <Lijo.Lazar@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx>
>>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>>>
>>>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
>>>>>> To maintain system error state when SMU errors occurred, which
>>>>>> will aid in debugging SMU firmware issues, add SMU debug option support.
>>>>>>
>>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>>>>> When enabled, it makes SMU errors fatal.
>>>>>> It is disabled by default.
>>>>>>
>>>>>> == Command Guide ==
>>>>>>
>>>>>> 1, enable SMU debug option
>>>>>>
>>>>>>     # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>
>>>>>> 2, disable SMU debug option
>>>>>>
>>>>>>     # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>
>>>>>> v3:
>>>>>>     - Use debugfs_create_bool().(Christian)
>>>>>>     - Put variable into smu_context struct.
>>>>>>     - Don't resend command when timeout.
>>>>>>
>>>>>> v2:
>>>>>>     - Resend command when timeout.(Lijo)
>>>>>>     - Use debugfs file instead of module parameter.
>>>>>>
>>>>>> Signed-off-by: Lang Yu <lang.yu@xxxxxxx>
>>>>> Well the debugfs part looks really nice and clean now, but one more
>>>>> comment below.
>>>>>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>>>>>     drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>>>>>     4 files changed, 17 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> index 164d6a9e9fbb..86cd888c7822 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>>>>> *adev)
>>>>>>     	if (!debugfs_initialized())
>>>>>>     		return 0;
>>>>>>
>>>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>>>>> +				  &adev->smu.smu_debug_mode);
>>>>>> +
>>>>>>     	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root,
>adev,
>>>>>>     				  &fops_ib_preempt);
>>>>>>     	if (IS_ERR(ent)) {
>>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>> index f738f7dc20c9..50dbf5594a9d 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>> @@ -569,6 +569,11 @@ struct smu_context
>>>>>>     	struct smu_user_dpm_profile user_dpm_profile;
>>>>>>
>>>>>>     	struct stb_context stb_context;
>>>>>> +	/*
>>>>>> +	 * When enabled, it makes SMU errors fatal.
>>>>>> +	 * (0 = disabled (default), 1 = enabled)
>>>>>> +	 */
>>>>>> +	bool smu_debug_mode;
>>>>>>     };
>>>>>>
>>>>>>     struct i2c_adapter;
>>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>> index 6e781cee8bb6..d3797a2d6451 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>>>>> smu_context *smu)
>>>>>>     out:
>>>>>>     	mutex_unlock(&smu->message_lock);
>>>>>>
>>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>>>>>> +
>>>>>>     	return ret;
>>>>>>     }
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> index 048ca1673863..9be005eb4241 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> @@ -349,15 +349,21 @@ int
>smu_cmn_send_smc_msg_with_param(struct
>>>>> smu_context *smu,
>>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
>msg);
>>>>>>     		goto Out;
>>>>>>     	}
>>>>>> +
>>>>>>     	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>>>     	reg = __smu_cmn_poll_stat(smu);
>>>>>>     	res = __smu_cmn_reg2errno(smu, reg);
>>>>>> -	if (res != 0)
>>>>>> +	if (res != 0) {
>>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
>msg);
>>>>>> +		goto Out;
>>>>>> +	}
>>>>>>     	if (read_arg)
>>>>>>     		smu_cmn_read_arg(smu, read_arg);
>>>>>>     Out:
>>>>>>     	mutex_unlock(&smu->message_lock);
>>>>>> +
>>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>>> BUG_ON() really crashes the kernel and is only allowed if we
>>>>> prevent further data corruption with that.
>>>>>
>>>>> Most of the time WARN_ON() is more appropriate, but I can't fully
>>>>> judge here since I don't know the SMU code well enough.
>>>> This is what SMU FW guys want. They want "user-visible (potentially
>>>> fatal)
>>> errors", then a hang.
>>>> They want to keep system state since the error occurred.
>>> Well that is rather problematic.
>>>
>>> First of all we need to really justify that, crashing the kernel is
>>> not something easily done.
>>>
>>> Then this isn't really effective here. What happens is that you crash
>>> the kernel thread of the currently executing process, but it is
>>> perfectly possible that another thread still tries to send messages
>>> to the SMU. You need to have the BUG_ON() before dropping the lock to
>>> make sure that this really gets the driver stuck in the current state.
>> Thanks. I got it. I just thought it is a kenel panic.
>> Could we use a panic() here?
>
>Potentially, but that might reboot the system automatically which is probably not
>what you want either.
>
>How does the SMU firmware team gather the necessary information when a
>problem occurs?

As far as I know, they usually use a HDT to collect information.
And they request a hang when error occurred in ticket.
"Suggested error responses include pop-up windows (by x86 driver, if this is possible) or simply hanging after logging the error. "

Regards,
Lang

>
>When they connect external hardware a BUG_ON() while holding the SMU lock
>might work, but if they want to SSH into the system you should probably rather
>set a flag that the hardware shouldn't be touched any more by the driver and
>continue.
>
>Otherwise the box is most likely really unstable after this.
>
>Regards,
>Christian.
>
>>
>> Regards,
>> Lang
>>
>>> Regards,
>>> Christian.
>>>
>>>> Regards,
>>>> Lang
>>>>
>>>>> Christian.
>>>>>
>>>>>> +
>>>>>>     	return res;
>>>>>>     }
>>>>>>