[AMD Official Use Only] >-----Original Message----- >From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Yu, Lang >Sent: Wednesday, December 1, 2021 3:58 PM >To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray ><Ray.Huang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> >Subject: RE: [PATCH] drm/amdgpu: add SMU debug option support > >[AMD Official Use Only] > > > >>-----Original Message----- >>From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> >>Sent: Wednesday, December 1, 2021 3:28 PM >>To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray >><Ray.Huang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> >>Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support >> >> >> >>On 12/1/2021 12:37 PM, Yu, Lang wrote: >>> [AMD Official Use Only] >>> >>> >>> >>>> -----Original Message----- >>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> >>>> Sent: Wednesday, December 1, 2021 2:56 PM >>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray >>>> <Ray.Huang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> >>>> Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support >>>> >>>> >>>> >>>> On 12/1/2021 11:57 AM, Yu, Lang wrote: >>>>> [AMD Official Use Only] >>>>> >>>>> Hi Lijo, >>>>> >>>>> Thanks for your comments. >>>>> >>>>> From my understanding, that just increases the timeout threshold >>>>> and could hide some potential issues which should be exposed and solved. >>>>> >>>>> If current timeout threshold is not enough for some corner cases, >>>>> (1) Do we consider to increase the threshold to cover these cases? >>>>> (2) Or do we just expose them and request SMU FW to optimize them? >>>>> >>>>> I think it doesn't make much sense to increase the threshold in debug mode. >>>>> How do you think? Thanks! >>>> >>>> In normal cases, 2secs would be more than enough. If we hang >>>> immediately, then check the FW registers later, the response would >>>> have come. I thought we just need to note those cases and not to >>>> fail everytime. Just to mark as a red flag in the log to tell us >>>> that FW is unexpectedly busy processing something else when the message is >sent. >>>> >>>> There are some issues related to S0ix where we see the FW comes back >>>> with a response with an increased timeout under certain conditions. >>> >>> If these issues still exists, could we just blacklist the tests that >>> triggered them before solve them? Or we just increase the threshold >>> to cover >>all the cases? >>> >> >>Actually, the timeout is message specific - like i2c transfer from >>EEPROM could take longer time. >> >>I am not sure if we should have more than 2s as timeout. Whenever this >>kind of issue happens, FW team check registers (then it will have a >>proper value) and say they don't see anything abnormal :) Usually, >>those are just signs of crack and it eventually breaks. >> >>Option is just fail immediately (then again not sure useful it will be >>if the issue is this sort of thing) or wait to see how far it goes with >>an added timeout before it fails eventually. > >Are smu_cmn_wait_for_response()/smu_cmn_send_msg_without_waiting() >designed for long timeout cases? Is it fine that we don't fail here in the event of >timeout? Or we can add a timeout parameter into smu_cmn_send_smc_msg_with_param() to specify the timeout you want for specific message. I think this may be another story. Thanks! Thanks, Lang > >> >>Thanks, >>Lijo >> >>> Regards, >>> Lang >>> >>>> >>>> Thanks, >>>> Lijo >>>> >>>>> >>>>> Regards, >>>>> Lang >>>>> >>>>>> -----Original Message----- >>>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> >>>>>> Sent: Wednesday, December 1, 2021 1:44 PM >>>>>> To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; Yu, Lang <Lang.Yu@xxxxxxx>; >>>>>> amd- gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray >>>>>> <Ray.Huang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>> Subject: RE: [PATCH] drm/amdgpu: add SMU debug option support >>>>>> >>>>>> Just realized that the patch I pasted won't work. Outer loop exit >>>>>> needs to be like this. >>>>>> (reg & MP1_C2PMSG_90__CONTENT_MASK) != 0 && extended_wait-- >= >>>>>> 0 >>>>>> >>>>>> Anyway, that patch is only there to communicate what I really >>>>>> meant in the earlier comment. >>>>>> >>>>>> Thanks, >>>>>> Lijo >>>>>> >>>>>> -----Original Message----- >>>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of >>>>>> Lazar, Lijo >>>>>> Sent: Wednesday, December 1, 2021 10:44 AM >>>>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray >>>>>> <Ray.Huang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>> Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support >>>>>> >>>>>> >>>>>> >>>>>> On 11/30/2021 10:47 AM, Lang Yu wrote: >>>>>>> To maintain system error state when SMU errors occurred, which >>>>>>> will aid in debugging SMU firmware issues, add SMU debug option >support. >>>>>>> >>>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file. >>>>>>> When enabled, it makes SMU errors fatal. >>>>>>> It is disabled by default. >>>>>>> >>>>>>> == Command Guide == >>>>>>> >>>>>>> 1, enable SMU debug option >>>>>>> >>>>>>> # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug >>>>>>> >>>>>>> 2, disable SMU debug option >>>>>>> >>>>>>> # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug >>>>>>> >>>>>>> v2: >>>>>>> - Resend command when timeout.(Lijo) >>>>>>> - Use debugfs file instead of module parameter. >>>>>>> >>>>>>> Signed-off-by: Lang Yu <lang.yu@xxxxxxx> >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 32 >>>> +++++++++++++++++ >>>>>>> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 39 >>>> +++++++++++++++++++- >>>>>> - >>>>>>> 2 files changed, 69 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>>>>> index 164d6a9e9fbb..f9412de86599 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >>>>>>> @@ -39,6 +39,8 @@ >>>>>>> >>>>>>> #if defined(CONFIG_DEBUG_FS) >>>>>>> >>>>>>> +extern int amdgpu_smu_debug; >>>>>>> + >>>>>>> /** >>>>>>> * amdgpu_debugfs_process_reg_op - Handle MMIO register >>reads/writes >>>>>>> * >>>>>>> @@ -1152,6 +1154,8 @@ static ssize_t >>>>>>> amdgpu_debugfs_gfxoff_read(struct >>>>>> file *f, char __user *buf, >>>>>>> return result; >>>>>>> } >>>>>>> >>>>>>> + >>>>>>> + >>>>>>> static const struct file_operations amdgpu_debugfs_regs2_fops = { >>>>>>> .owner = THIS_MODULE, >>>>>>> .unlocked_ioctl = amdgpu_debugfs_regs2_ioctl, @@ -1609,6 >>>>>>> +1613,26 @@ DEFINE_DEBUGFS_ATTRIBUTE(fops_ib_preempt, NULL, >>>>>>> DEFINE_DEBUGFS_ATTRIBUTE(fops_sclk_set, NULL, >>>>>>> amdgpu_debugfs_sclk_set, "%llu\n"); >>>>>>> >>>>>>> +static int amdgpu_debugfs_smu_debug_get(void *data, u64 *val) { >>>>>>> + *val = amdgpu_smu_debug; >>>>>>> + return 0; >>>>>>> +} >>>>>>> + >>>>>>> +static int amdgpu_debugfs_smu_debug_set(void *data, u64 val) { >>>>>>> + if (val != 0 && val != 1) >>>>>>> + return -EINVAL; >>>>>>> + >>>>>>> + amdgpu_smu_debug = val; >>>>>>> + return 0; >>>>>>> +} >>>>>>> + >>>>>>> +DEFINE_DEBUGFS_ATTRIBUTE(fops_smu_debug, >>>>>>> + amdgpu_debugfs_smu_debug_get, >>>>>>> + amdgpu_debugfs_smu_debug_set, >>>>>>> + "%llu\n"); >>>>>>> + >>>>>>> int amdgpu_debugfs_init(struct amdgpu_device *adev) >>>>>>> { >>>>>>> struct dentry *root = >>>>>>> adev_to_drm(adev)->primary->debugfs_root; >>>>>>> @@ -1632,6 +1656,14 @@ int amdgpu_debugfs_init(struct >>>>>>> amdgpu_device >>>>>> *adev) >>>>>>> return PTR_ERR(ent); >>>>>>> } >>>>>>> >>>>>>> + ent = debugfs_create_file("amdgpu_smu_debug", 0600, root, >adev, >>>>>>> + &fops_smu_debug); >>>>>>> + if (IS_ERR(ent)) { >>>>>>> + DRM_ERROR("unable to create amdgpu_smu_debug >debugsfs >>>>>> file\n"); >>>>>>> + return PTR_ERR(ent); >>>>>>> + } >>>>>>> + >>>>>>> + >>>>>>> /* Register debugfs entries for amdgpu_ttm */ >>>>>>> amdgpu_ttm_debugfs_init(adev); >>>>>>> amdgpu_debugfs_pm_init(adev); diff --git >>>>>>> a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>>>>> index 048ca1673863..b3969d7933d3 100644 >>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>>>>> @@ -55,6 +55,14 @@ >>>>>>> >>>>>>> #undef __SMU_DUMMY_MAP >>>>>>> #define __SMU_DUMMY_MAP(type) #type >>>>>>> + >>>>>>> +/* >>>>>>> + * Used to enable SMU debug option. When enabled, it makes SMU >>>>>>> +errors >>>>>> fatal. >>>>>>> + * This will aid in debugging SMU firmware issues. >>>>>>> + * (0 = disabled (default), 1 = enabled) */ int >>>>>>> + amdgpu_smu_debug; >>>>>>> + >>>>>>> static const char * const __smu_message_names[] = { >>>>>>> SMU_MESSAGE_TYPES >>>>>>> }; >>>>>>> @@ -272,6 +280,11 @@ int smu_cmn_send_msg_without_waiting(struct >>>>>> smu_context *smu, >>>>>>> __smu_cmn_send_msg(smu, msg_index, param); >>>>>>> res = 0; >>>>>>> Out: >>>>>>> + if (unlikely(amdgpu_smu_debug == 1) && res) { >>>>>>> + mutex_unlock(&smu->message_lock); >>>>>>> + BUG(); >>>>>>> + } >>>>>>> + >>>>>>> return res; >>>>>>> } >>>>>>> >>>>>>> @@ -288,9 +301,17 @@ int smu_cmn_send_msg_without_waiting(struct >>>>>> smu_context *smu, >>>>>>> int smu_cmn_wait_for_response(struct smu_context *smu) >>>>>>> { >>>>>>> u32 reg; >>>>>>> + int res; >>>>>>> >>>>>>> reg = __smu_cmn_poll_stat(smu); >>>>>>> - return __smu_cmn_reg2errno(smu, reg); >>>>>>> + res = __smu_cmn_reg2errno(smu, reg); >>>>>>> + >>>>>>> + if (unlikely(amdgpu_smu_debug == 1) && res) { >>>>>>> + mutex_unlock(&smu->message_lock); >>>>>>> + BUG(); >>>>>>> + } >>>>>>> + >>>>>>> + return res; >>>>>>> } >>>>>>> >>>>>>> /** >>>>>>> @@ -328,6 +349,7 @@ int smu_cmn_send_smc_msg_with_param(struct >>>>>> smu_context *smu, >>>>>>> uint32_t param, >>>>>>> uint32_t *read_arg) >>>>>>> { >>>>>>> + int retry_count = 0; >>>>>>> int res, index; >>>>>>> u32 reg; >>>>>>> >>>>>>> @@ -349,15 +371,28 @@ int >>smu_cmn_send_smc_msg_with_param(struct >>>>>> smu_context *smu, >>>>>>> __smu_cmn_reg_print_error(smu, reg, index, param, >>msg); >>>>>>> goto Out; >>>>>>> } >>>>>>> +retry: >>>>>>> __smu_cmn_send_msg(smu, (uint16_t) index, param); >>>>>>> reg = __smu_cmn_poll_stat(smu); >>>>>>> res = __smu_cmn_reg2errno(smu, reg); >>>>>>> - if (res != 0) >>>>>>> + if (res != 0) { >>>>>>> __smu_cmn_reg_print_error(smu, reg, index, param, >>msg); >>>>>>> + if ((res == -ETIME) && (retry_count++ < 1)) { >>>>>>> + usleep_range(500, 1000); >>>>>>> + dev_err(smu->adev->dev, >>>>>>> + "SMU: resend command: index:%d >>>>>> param:0x%08X message:%s", >>>>>>> + index, param, >smu_get_message_name(smu, >>>>>> msg)); >>>>>>> + goto retry; >>>>>>> + } >>>>>>> + goto Out; >>>>>>> + } >>>>>> >>>>>> Sorry, what I meant is to have an extended wait time in debug mode. >>>>>> Something like below, not a 'full retry' as in sending the message again. >>>>>> >>>>>> >>>>>> +#define MAX_DBG_WAIT_CNT 3 >>>>>> + >>>>>> /** >>>>>> * __smu_cmn_poll_stat -- poll for a status from the SMU >>>>>> * smu: a pointer to SMU context @@ -115,17 +117,24 @@ static >>>>>> void smu_cmn_read_arg(struct smu_context *smu, >>>>>> static u32 __smu_cmn_poll_stat(struct smu_context *smu) >>>>>> { >>>>>> struct amdgpu_device *adev = smu->adev; >>>>>> - int timeout = adev->usec_timeout * 20; >>>>>> + int timeout; >>>>>> u32 reg; >>>>>> + int extended_wait = smu_debug_mode ? MAX_DBG_WAIT_CNT : 0; >>>>>> >>>>>> - for ( ; timeout > 0; timeout--) { >>>>>> - reg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90); >>>>>> - if ((reg & MP1_C2PMSG_90__CONTENT_MASK) != 0) >>>>>> - break; >>>>>> + do { >>>>>> + timeout = adev->usec_timeout * 20; >>>>>> + for (; timeout > 0; timeout--) { >>>>>> + reg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90); >>>>>> + if ((reg & MP1_C2PMSG_90__CONTENT_MASK) != 0) >>>>>> + break; >>>>>> >>>>>> - udelay(1); >>>>>> - } >>>>>> + udelay(1); >>>>>> + } >>>>>> + } while (extended_wait-- >= 0); >>>>>> >>>>>> + if (extended_wait != MAX_DBG_WAIT_CNT && reg != >>SMU_RESP_NONE) >>>>>> + dev_err(adev->dev, >>>>>> + "SMU: Unexpected extended wait for >>>>>> + response"); >>>>>> return reg; >>>>>> } >>>>>> >>>>>> Thanks, >>>>>> Lijo >>>>>> >>>>>>> if (read_arg) >>>>>>> smu_cmn_read_arg(smu, read_arg); >>>>>>> Out: >>>>>>> mutex_unlock(&smu->message_lock); >>>>>>> + >>>>>>> + BUG_ON(unlikely(amdgpu_smu_debug == 1) && res); >>>>>>> + >>>>>>> return res; >>>>>>> } >>>>>>> >>>>>>>