RE: [PATCH] drm/amdgpu: add SMU debug option support

"Yu, Lang" <Lang.Yu@xxxxxxx> · Wed, 1 Dec 2021 07:57:57 +0000

[AMD Official Use Only]

>-----Original Message-----
>From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>Sent: Wednesday, December 1, 2021 3:28 PM
>To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray
><Ray.Huang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
>Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support
>
>
>
>On 12/1/2021 12:37 PM, Yu, Lang wrote:
>> [AMD Official Use Only]
>>
>>
>>
>>> -----Original Message-----
>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>>> Sent: Wednesday, December 1, 2021 2:56 PM
>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray
>>> <Ray.Huang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
>>> Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support
>>>
>>>
>>>
>>> On 12/1/2021 11:57 AM, Yu, Lang wrote:
>>>> [AMD Official Use Only]
>>>>
>>>> Hi Lijo,
>>>>
>>>> Thanks for your comments.
>>>>
>>>>   From my understanding, that just increases the timeout threshold
>>>> and could hide some potential issues which should be exposed and solved.
>>>>
>>>> If current timeout threshold is not enough for some corner cases,
>>>> (1) Do we consider to increase the threshold to cover these cases?
>>>> (2) Or do we just expose them and request SMU FW to optimize them?
>>>>
>>>> I think it doesn't make much sense to increase the threshold in debug mode.
>>>> How do you think? Thanks!
>>>
>>> In normal cases, 2secs would be more than enough. If we hang
>>> immediately, then check the FW registers later, the response would
>>> have come. I thought we just need to note those cases and not to fail
>>> everytime. Just to mark as a red flag in the log to tell us that FW
>>> is unexpectedly busy processing something else when the message is sent.
>>>
>>> There are some issues related to S0ix where we see the FW comes back
>>> with a response with an increased timeout under certain conditions.
>>
>> If these issues still exists, could we just blacklist the tests that
>> triggered them before solve them? Or we just increase the threshold to cover
>all the cases?
>>
>
>Actually, the timeout is message specific - like i2c transfer from EEPROM could
>take longer time.
>
>I am not sure if we should have more than 2s as timeout. Whenever this kind of
>issue happens, FW team check registers (then it will have a proper value) and say
>they don't see anything abnormal :) Usually, those are just signs of crack and it
>eventually breaks.
>
>Option is just fail immediately (then again not sure useful it will be if the issue is
>this sort of thing) or wait to see how far it goes with an added timeout before it
>fails eventually.

Does smu_cmn_wait_for_response()/smu_cmn_send_msg_without_waiting() are 
designed for long timeout cases? Is it fine that we don't fail here in the event of timeout?

Thanks,
Lang 

>
>Thanks,
>Lijo
>
>> Regards,
>> Lang
>>
>>>
>>> Thanks,
>>> Lijo
>>>
>>>>
>>>> Regards,
>>>> Lang
>>>>
>>>>> -----Original Message-----
>>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>>>>> Sent: Wednesday, December 1, 2021 1:44 PM
>>>>> To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; Yu, Lang <Lang.Yu@xxxxxxx>;
>>>>> amd- gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray
>>>>> <Ray.Huang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>>> Subject: RE: [PATCH] drm/amdgpu: add SMU debug option support
>>>>>
>>>>> Just realized that the patch I pasted won't work. Outer loop exit
>>>>> needs to be like this.
>>>>> 	(reg & MP1_C2PMSG_90__CONTENT_MASK) != 0 && extended_wait-- >=
>>>>> 0
>>>>>
>>>>> Anyway, that patch is only there to communicate what I really meant
>>>>> in the earlier comment.
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>> -----Original Message-----
>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
>>>>> Lazar, Lijo
>>>>> Sent: Wednesday, December 1, 2021 10:44 AM
>>>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray
>>>>> <Ray.Huang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
>>>>> Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support
>>>>>
>>>>>
>>>>>
>>>>> On 11/30/2021 10:47 AM, Lang Yu wrote:
>>>>>> To maintain system error state when SMU errors occurred, which
>>>>>> will aid in debugging SMU firmware issues, add SMU debug option support.
>>>>>>
>>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>>>>> When enabled, it makes SMU errors fatal.
>>>>>> It is disabled by default.
>>>>>>
>>>>>> == Command Guide ==
>>>>>>
>>>>>> 1, enable SMU debug option
>>>>>>
>>>>>>     # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>
>>>>>> 2, disable SMU debug option
>>>>>>
>>>>>>     # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>
>>>>>> v2:
>>>>>>     - Resend command when timeout.(Lijo)
>>>>>>     - Use debugfs file instead of module parameter.
>>>>>>
>>>>>> Signed-off-by: Lang Yu <lang.yu@xxxxxxx>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 32
>>> +++++++++++++++++
>>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c      | 39
>>> +++++++++++++++++++-
>>>>> -
>>>>>>     2 files changed, 69 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> index 164d6a9e9fbb..f9412de86599 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> @@ -39,6 +39,8 @@
>>>>>>
>>>>>>     #if defined(CONFIG_DEBUG_FS)
>>>>>>
>>>>>> +extern int amdgpu_smu_debug;
>>>>>> +
>>>>>>     /**
>>>>>>      * amdgpu_debugfs_process_reg_op - Handle MMIO register
>reads/writes
>>>>>>      *
>>>>>> @@ -1152,6 +1154,8 @@ static ssize_t
>>>>>> amdgpu_debugfs_gfxoff_read(struct
>>>>> file *f, char __user *buf,
>>>>>>     	return result;
>>>>>>     }
>>>>>>
>>>>>> +
>>>>>> +
>>>>>>     static const struct file_operations amdgpu_debugfs_regs2_fops = {
>>>>>>     	.owner = THIS_MODULE,
>>>>>>     	.unlocked_ioctl = amdgpu_debugfs_regs2_ioctl, @@ -1609,6
>>>>>> +1613,26 @@ DEFINE_DEBUGFS_ATTRIBUTE(fops_ib_preempt, NULL,
>>>>>>     DEFINE_DEBUGFS_ATTRIBUTE(fops_sclk_set, NULL,
>>>>>>     			amdgpu_debugfs_sclk_set, "%llu\n");
>>>>>>
>>>>>> +static int amdgpu_debugfs_smu_debug_get(void *data, u64 *val) {
>>>>>> +	*val = amdgpu_smu_debug;
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static int amdgpu_debugfs_smu_debug_set(void *data, u64 val) {
>>>>>> +	if (val != 0 && val != 1)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	amdgpu_smu_debug = val;
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +
>>>>>> +DEFINE_DEBUGFS_ATTRIBUTE(fops_smu_debug,
>>>>>> +			 amdgpu_debugfs_smu_debug_get,
>>>>>> +			 amdgpu_debugfs_smu_debug_set,
>>>>>> +			 "%llu\n");
>>>>>> +
>>>>>>     int amdgpu_debugfs_init(struct amdgpu_device *adev)
>>>>>>     {
>>>>>>     	struct dentry *root =
>>>>>> adev_to_drm(adev)->primary->debugfs_root;
>>>>>> @@ -1632,6 +1656,14 @@ int amdgpu_debugfs_init(struct
>>>>>> amdgpu_device
>>>>> *adev)
>>>>>>     		return PTR_ERR(ent);
>>>>>>     	}
>>>>>>
>>>>>> +	ent = debugfs_create_file("amdgpu_smu_debug", 0600, root, adev,
>>>>>> +				  &fops_smu_debug);
>>>>>> +	if (IS_ERR(ent)) {
>>>>>> +		DRM_ERROR("unable to create amdgpu_smu_debug debugsfs
>>>>> file\n");
>>>>>> +		return PTR_ERR(ent);
>>>>>> +	}
>>>>>> +
>>>>>> +
>>>>>>     	/* Register debugfs entries for amdgpu_ttm */
>>>>>>     	amdgpu_ttm_debugfs_init(adev);
>>>>>>     	amdgpu_debugfs_pm_init(adev); diff --git
>>>>>> a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> index 048ca1673863..b3969d7933d3 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> @@ -55,6 +55,14 @@
>>>>>>
>>>>>>     #undef __SMU_DUMMY_MAP
>>>>>>     #define __SMU_DUMMY_MAP(type)	#type
>>>>>> +
>>>>>> +/*
>>>>>> + * Used to enable SMU debug option. When enabled, it makes SMU
>>>>>> +errors
>>>>> fatal.
>>>>>> + * This will aid in debugging SMU firmware issues.
>>>>>> + * (0 = disabled (default), 1 = enabled)  */ int
>>>>>> + amdgpu_smu_debug;
>>>>>> +
>>>>>>     static const char * const __smu_message_names[] = {
>>>>>>     	SMU_MESSAGE_TYPES
>>>>>>     };
>>>>>> @@ -272,6 +280,11 @@ int smu_cmn_send_msg_without_waiting(struct
>>>>> smu_context *smu,
>>>>>>     	__smu_cmn_send_msg(smu, msg_index, param);
>>>>>>     	res = 0;
>>>>>>     Out:
>>>>>> +	if (unlikely(amdgpu_smu_debug == 1) && res) {
>>>>>> +		mutex_unlock(&smu->message_lock);
>>>>>> +		BUG();
>>>>>> +	}
>>>>>> +
>>>>>>     	return res;
>>>>>>     }
>>>>>>
>>>>>> @@ -288,9 +301,17 @@ int smu_cmn_send_msg_without_waiting(struct
>>>>> smu_context *smu,
>>>>>>     int smu_cmn_wait_for_response(struct smu_context *smu)
>>>>>>     {
>>>>>>     	u32 reg;
>>>>>> +	int res;
>>>>>>
>>>>>>     	reg = __smu_cmn_poll_stat(smu);
>>>>>> -	return __smu_cmn_reg2errno(smu, reg);
>>>>>> +	res = __smu_cmn_reg2errno(smu, reg);
>>>>>> +
>>>>>> +	if (unlikely(amdgpu_smu_debug == 1) && res) {
>>>>>> +		mutex_unlock(&smu->message_lock);
>>>>>> +		BUG();
>>>>>> +	}
>>>>>> +
>>>>>> +	return res;
>>>>>>     }
>>>>>>
>>>>>>     /**
>>>>>> @@ -328,6 +349,7 @@ int smu_cmn_send_smc_msg_with_param(struct
>>>>> smu_context *smu,
>>>>>>     				    uint32_t param,
>>>>>>     				    uint32_t *read_arg)
>>>>>>     {
>>>>>> +	int retry_count = 0;
>>>>>>     	int res, index;
>>>>>>     	u32 reg;
>>>>>>
>>>>>> @@ -349,15 +371,28 @@ int
>smu_cmn_send_smc_msg_with_param(struct
>>>>> smu_context *smu,
>>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
>msg);
>>>>>>     		goto Out;
>>>>>>     	}
>>>>>> +retry:
>>>>>>     	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>>>     	reg = __smu_cmn_poll_stat(smu);
>>>>>>     	res = __smu_cmn_reg2errno(smu, reg);
>>>>>> -	if (res != 0)
>>>>>> +	if (res != 0) {
>>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
>msg);
>>>>>> +		if ((res == -ETIME) && (retry_count++ < 1)) {
>>>>>> +			usleep_range(500, 1000);
>>>>>> +			dev_err(smu->adev->dev,
>>>>>> +				"SMU: resend command: index:%d
>>>>> param:0x%08X message:%s",
>>>>>> +				index, param, smu_get_message_name(smu,
>>>>> msg));
>>>>>> +			goto retry;
>>>>>> +		}
>>>>>> +		goto Out;
>>>>>> +	}
>>>>>
>>>>> Sorry, what I meant is to have an extended wait time in debug mode.
>>>>> Something like below, not a 'full retry' as in sending the message again.
>>>>>
>>>>>
>>>>> +#define MAX_DBG_WAIT_CNT 3
>>>>> +
>>>>>    /**
>>>>>     * __smu_cmn_poll_stat -- poll for a status from the SMU
>>>>>     * smu: a pointer to SMU context @@ -115,17 +117,24 @@ static
>>>>> void smu_cmn_read_arg(struct smu_context *smu,
>>>>>    static u32 __smu_cmn_poll_stat(struct smu_context *smu)
>>>>>    {
>>>>>           struct amdgpu_device *adev = smu->adev;
>>>>> -       int timeout = adev->usec_timeout * 20;
>>>>> +       int timeout;
>>>>>           u32 reg;
>>>>> +       int extended_wait = smu_debug_mode ? MAX_DBG_WAIT_CNT : 0;
>>>>>
>>>>> -       for ( ; timeout > 0; timeout--) {
>>>>> -               reg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
>>>>> -               if ((reg & MP1_C2PMSG_90__CONTENT_MASK) != 0)
>>>>> -                       break;
>>>>> +       do {
>>>>> +               timeout = adev->usec_timeout * 20;
>>>>> +               for (; timeout > 0; timeout--) {
>>>>> +                       reg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
>>>>> +                       if ((reg & MP1_C2PMSG_90__CONTENT_MASK) != 0)
>>>>> +                               break;
>>>>>
>>>>> -               udelay(1);
>>>>> -       }
>>>>> +                       udelay(1);
>>>>> +               }
>>>>> +       } while (extended_wait-- >= 0);
>>>>>
>>>>> +       if (extended_wait != MAX_DBG_WAIT_CNT && reg !=
>SMU_RESP_NONE)
>>>>> +               dev_err(adev->dev,
>>>>> +                       "SMU: Unexpected extended wait for
>>>>> + response");
>>>>>           return reg;
>>>>>    }
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>>     	if (read_arg)
>>>>>>     		smu_cmn_read_arg(smu, read_arg);
>>>>>>     Out:
>>>>>>     	mutex_unlock(&smu->message_lock);
>>>>>> +
>>>>>> +	BUG_ON(unlikely(amdgpu_smu_debug == 1) && res);
>>>>>> +
>>>>>>     	return res;
>>>>>>     }
>>>>>>
>>>>>>