On 2021-07-14 11:22 a.m., Alex Deucher wrote: > On Tue, Jul 13, 2021 at 10:01 PM Quan, Evan <Evan.Quan@xxxxxxx> wrote: >> [AMD Official Use Only] >> >> >> >>> -----Original Message----- >>> From: Tuikov, Luben <Luben.Tuikov@xxxxxxx> >>> Sent: Wednesday, July 14, 2021 1:36 AM >>> To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Alimucaj, Andi >>> <Anduil.Alimucaj@xxxxxxx>; Baluja, Aaron <Aaron.Baluja@xxxxxxx> >>> Subject: Re: [PATCH] drm/amd/pm: Fix a bug communicating with the SMU >>> >>> On 2021-07-13 3:07 a.m., Quan, Evan wrote: >>>> [AMD Official Use Only] >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: Tuikov, Luben <Luben.Tuikov@xxxxxxx> >>>>> Sent: Monday, July 12, 2021 11:31 PM >>>>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>> Cc: Tuikov, Luben <Luben.Tuikov@xxxxxxx>; Deucher, Alexander >>>>> <Alexander.Deucher@xxxxxxx>; Quan, Evan <Evan.Quan@xxxxxxx> >>>>> Subject: [PATCH] drm/amd/pm: Fix a bug communicating with the SMU >>>>> >>>>> This fixes a bug which if we probe a non-existing I2C device, and the >>>>> SMU returns 0xFF, >>>> [Quan, Evan] Do we have to probe I2C device via issuing commands to SMU >>> and check existence via SMU response? >>> >>> Yes, yes we do. >>> >>>> Is there other more friendly way? >>> No, there isn't. >>> >>>> >from then on >>>>> we can never communicate with the SMU, because the code before this >>>>> patch reads and interprets 0xFF as a terminal error, and thus we >>>>> never write 0 into register 90 to clear the status (and subsequently >>>>> send a new command to the SMU.) >>>>> >>>>> It is not an error that the SMU returns status 0xFF. This means that >>>>> the SMU executed the last command successfully (execution status), >>>>> but the command result is an error of some sort (execution result), >>>>> depending on what the command was. >>>>> >>>>> When doing a status check of the SMU, before we send a new command, >>>>> the only status which precludes us from sending a new command is >>>>> 0--the SMU hasn't finished executing a previous command, and >>>>> 0xFC--the SMU is busy. >>>>> >>>>> This bug was seen as the following line in the kernel log, >>>>> >>>>> amdgpu: Msg issuing pre-check failed(0xff) and SMU may be not in the >>>>> right state! >>>> [Quan, Evan] This was designed to prevent failure scenario from ruin. >>>> Via this, we can prevent those SMU command/response related registers >>> overwritten. >>>> Then PMFW team can known which command invoked the first error. >>> Sorry, this is not correct. >>> >>> The interface cannot block valid access to the SMU firmware, just because a >>> command executed successfully, but with a failed result, i.e. set a clock >>> frequency to such-and-such frequency was executed successfully by the >>> SMU, but the frequency wasn't able to be set as required--the SMU IS NOT >>> BLOCKED, but can execute more commands. >> [Quan, Evan] >> 1. First of all, if the response from SMU is not 1, it means SMU must detected something wrong. >> We(driver) should properly handle that. I do not think it's a good idea to silently ignore that and proceed more commands. >> 2. Please be noticed that many commands(SMU messages) have dependence with each other. It means even if the further command >> can be executed "successfully", it may be not executed in the expected way. >>> If the PMFW team wants to know which command invoked "the first error", >>> they can check this explicitly, they can call smu_cmn_wait_for_response(), >>> just like the reset code does--see the reset code. >> [Quan, Evan] Per my knowledge gained during co-work with PMFW team, they expect no further driver-smu interaction(driver stops issuing command) >> when something went wrong. As they highly rely on SMU internal statistics for their debugging and further interaction may ruin them. >>> The way commit fcb1fe9c9e0031 modified the code, it blocks access to the >>> SMU for various other users of the SMU, not least of which is the I2C. >> [Quan, Evan] I'm wondering can we just list the I2C case as an exception. I means for the SMU command related with I2C we always assume the response is 1. >> For other commands, we just leave them as they are. > Maybe we need to just bubble this up to the callers and let each one > handle it as appropriate. I don't think just throwing up our hands on > any error is the right thing to do. The i2c case is a good example. > In other cases, we could probably just retry the transaction or just > ignore the error. My thoughts exactly. I'll generate a v3 of the patch to address the issues brought here and elsewhere. Regards, Luben > > Alex > > >> BR >> Evan >>> Regards, >>> Luben >>> >>>> BR >>>> Evan >>>>> when subsequent SMU commands, not necessarily related to I2C, were >>>>> sent to the SMU. >>>>> >>>>> This patch fixes this bug. >>>>> >>>>> Cc: Alex Deucher <Alexander.Deucher@xxxxxxx> >>>>> Cc: Evan Quan <evan.quan@xxxxxxx> >>>>> Fixes: fcb1fe9c9e0031 ("drm/amd/powerplay: pre-check the SMU state >>>>> before issuing message") >>>>> Signed-off-by: Luben Tuikov <luben.tuikov@xxxxxxx> >>>>> --- >>>>> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 196 >>>>> +++++++++++++++++++------ >>>>> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h | 3 +- >>>>> 2 files changed, 152 insertions(+), 47 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>>> index c902fdf322c1be..775eb50a2e49a6 100644 >>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c >>>>> @@ -55,7 +55,7 @@ >>>>> >>>>> #undef __SMU_DUMMY_MAP >>>>> #define __SMU_DUMMY_MAP(type) #type >>>>> -static const char* __smu_message_names[] = { >>>>> +static const char * const __smu_message_names[] = { >>>>> SMU_MESSAGE_TYPES >>>>> }; >>>>> >>>>> @@ -76,46 +76,161 @@ static void smu_cmn_read_arg(struct >>> smu_context >>>>> *smu, >>>>> *arg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82); } >>>>> >>>>> -int smu_cmn_wait_for_response(struct smu_context *smu) >>>>> +/** >>>>> + * __smu_cmn_poll_stat -- poll for a status from the SMU >>>>> + * smu: a pointer to SMU context >>>>> + * >>>>> + * Returns the status of the SMU, which could be, >>>>> + * 0, the SMU is busy with your previous command; >>>>> + * 1, execution status: success, execution result: success; >>>>> + * 0xFF, execution status: success, execution result: failure; >>>>> + * 0xFE, unknown command; >>>>> + * 0xFD, valid command, but bad (command) prerequisites; >>>>> + * 0xFC, the command was rejected as the SMU is busy; >>>>> + * 0xFB, "SMC_Result_DebugDataDumpEnd". >>>>> + */ >>>>> +static u32 __smu_cmn_poll_stat(struct smu_context *smu) >>>>> { >>>>> struct amdgpu_device *adev = smu->adev; >>>>> - uint32_t cur_value, i, timeout = adev->usec_timeout * 20; >>>>> + int timeout = adev->usec_timeout * 20; >>>>> + u32 reg; >>>>> >>>>> - for (i = 0; i < timeout; i++) { >>>>> - cur_value = RREG32_SOC15(MP1, 0, >>>>> mmMP1_SMN_C2PMSG_90); >>>>> - if ((cur_value & MP1_C2PMSG_90__CONTENT_MASK) != 0) >>>>> - return cur_value; >>>>> + for ( ; timeout > 0; timeout--) { >>>>> + reg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90); >>>>> + if ((reg & MP1_C2PMSG_90__CONTENT_MASK) != 0) >>>>> + break; >>>>> >>>>> udelay(1); >>>>> } >>>>> >>>>> - /* timeout means wrong logic */ >>>>> - if (i == timeout) >>>>> - return -ETIME; >>>>> - >>>>> - return RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90); >>>>> + return reg; >>>>> } >>>>> >>>>> -int smu_cmn_send_msg_without_waiting(struct smu_context *smu, >>>>> - uint16_t msg, uint32_t param) >>>>> +static void __smu_cmn_reg_print_error(struct smu_context *smu, >>>>> + u32 reg_c2pmsg_90, >>>>> + int msg_index, >>>>> + u32 param, >>>>> + enum smu_message_type msg) >>>>> { >>>>> struct amdgpu_device *adev = smu->adev; >>>>> - int ret; >>>>> + const char *message = smu_get_message_name(smu, msg); >>>>> >>>>> - ret = smu_cmn_wait_for_response(smu); >>>>> - if (ret != 0x1) { >>>>> - dev_err(adev->dev, "Msg issuing pre-check failed(0x%x) and >>>>> " >>>>> - "SMU may be not in the right state!\n", ret); >>>>> - if (ret != -ETIME) >>>>> - ret = -EIO; >>>>> - return ret; >>>>> + switch (reg_c2pmsg_90) { >>>>> + case 0: >>>>> + dev_err_ratelimited(adev->dev, >>>>> + "SMU: I'm not done with your previous >>>>> command!"); >>>>> + break; >>>>> + case 1: >>>>> + /* The SMU executed the command. It completed with a >>>>> + * successful result. >>>>> + */ >>>>> + break; >>>>> + case 0xFF: >>>>> + /* The SMU executed the command. It completed with a >>>>> + * unsuccessful result. >>>>> + */ >>>>> + break; >>>>> + case 0xFE: >>>>> + dev_err_ratelimited(adev->dev, >>>>> + "SMU: unknown command: index:%d >>>>> param:0x%08X message:%s", >>>>> + msg_index, param, message); >>>>> + break; >>>>> + case 0xFD: >>>>> + dev_err_ratelimited(adev->dev, >>>>> + "SMU: valid command, bad prerequisites: >>>>> index:%d param:0x%08X message:%s", >>>>> + msg_index, param, message); >>>>> + break; >>>>> + case 0xFC: >>>>> + dev_err_ratelimited(adev->dev, >>>>> + "SMU: I'm very busy for your command: >>>>> index:%d param:0x%08X message:%s", >>>>> + msg_index, param, message); >>>>> + break; >>>>> + case 0xFB: >>>>> + dev_err_ratelimited(adev->dev, >>>>> + "SMU: I'm debugging!"); >>>>> + break; >>>>> + default: >>>>> + dev_err_ratelimited(adev->dev, >>>>> + "SMU: response:0x%08X for index:%d >>>>> param:0x%08X message:%s?", >>>>> + reg_c2pmsg_90, msg_index, param, >>>>> message); >>>>> + break; >>>>> + } >>>>> +} >>>>> + >>>>> +static int __smu_cmn_reg2errno(struct smu_context *smu, u32 >>>>> reg_c2pmsg_90) >>>>> +{ >>>>> + int res; >>>>> + >>>>> + switch (reg_c2pmsg_90) { >>>>> + case 0: >>>>> + res = -ETIME; >>>>> + break; >>>>> + case 1: >>>>> + res = 0; >>>>> + break; >>>>> + case 0xFF: >>>>> + res = -EIO; >>>>> + break; >>>>> + case 0xFE: >>>>> + res = -EOPNOTSUPP; >>>>> + break; >>>>> + case 0xFD: >>>>> + res = -EIO; >>>>> + break; >>>>> + case 0xFC: >>>>> + res = -EBUSY; >>>>> + break; >>>>> + case 0xFB: >>>>> + res = -EIO; >>>>> + break; >>>>> + default: >>>>> + res = -EIO; >>>>> + break; >>>>> } >>>>> >>>>> + return res; >>>>> +} >>>>> + >>>>> +static void __smu_cmn_send_msg(struct smu_context *smu, >>>>> + u16 msg, >>>>> + u32 param) >>>>> +{ >>>>> + struct amdgpu_device *adev = smu->adev; >>>>> + >>>>> WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90, 0); >>>>> WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82, param); >>>>> WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_66, msg); >>>>> +} >>>>> >>>>> - return 0; >>>>> +int smu_cmn_send_msg_without_waiting(struct smu_context *smu, >>>>> + uint16_t msg_index, >>>>> + uint32_t param) >>>>> +{ >>>>> + u32 reg; >>>>> + int res; >>>>> + >>>>> + if (smu->adev->in_pci_err_recovery) >>>>> + return 0; >>>>> + >>>>> + mutex_lock(&smu->message_lock); >>>>> + reg = __smu_cmn_poll_stat(smu); >>>>> + if (reg == 0 || reg == 0xFC) { >>>>> + res = __smu_cmn_reg2errno(smu, reg); >>>>> + goto Out; >>>>> + } >>>>> + __smu_cmn_send_msg(smu, msg_index, param); >>>>> + res = 0; >>>>> +Out: >>>>> + mutex_unlock(&smu->message_lock); >>>>> + return res; >>>>> +} >>>>> + >>>>> +int smu_cmn_wait_for_response(struct smu_context *smu) { >>>>> + u32 reg; >>>>> + >>>>> + reg = __smu_cmn_poll_stat(smu); >>>>> + return __smu_cmn_reg2errno(smu, reg); >>>>> } >>>>> >>>>> int smu_cmn_send_smc_msg_with_param(struct smu_context *smu, >>> @@ >>>>> -123,8 +238,8 @@ int smu_cmn_send_smc_msg_with_param(struct >>>>> smu_context *smu, >>>>> uint32_t param, >>>>> uint32_t *read_arg) >>>>> { >>>>> - struct amdgpu_device *adev = smu->adev; >>>>> - int ret = 0, index = 0; >>>>> + int res, index; >>>>> + u32 reg; >>>>> >>>>> if (smu->adev->in_pci_err_recovery) >>>>> return 0; >>>>> @@ -136,31 +251,20 @@ int >>> smu_cmn_send_smc_msg_with_param(struct >>>>> smu_context *smu, >>>>> return index == -EACCES ? 0 : index; >>>>> >>>>> mutex_lock(&smu->message_lock); >>>>> - ret = smu_cmn_send_msg_without_waiting(smu, (uint16_t)index, >>>>> param); >>>>> - if (ret) >>>>> - goto out; >>>>> - >>>>> - ret = smu_cmn_wait_for_response(smu); >>>>> - if (ret != 0x1) { >>>>> - if (ret == -ETIME) { >>>>> - dev_err(adev->dev, "message: %15s (%d) \tparam: >>>>> 0x%08x is timeout (no response)\n", >>>>> - smu_get_message_name(smu, msg), index, >>>>> param); >>>>> - } else { >>>>> - dev_err(adev->dev, "failed send message: %15s (%d) >>>>> \tparam: 0x%08x response %#x\n", >>>>> - smu_get_message_name(smu, msg), index, >>>>> param, >>>>> - ret); >>>>> - ret = -EIO; >>>>> - } >>>>> - goto out; >>>>> + reg = __smu_cmn_poll_stat(smu); >>>>> + if (reg == 0 || reg == 0xFC) { >>>>> + res = __smu_cmn_reg2errno(smu, reg); >>>>> + __smu_cmn_reg_print_error(smu, reg, index, param, msg); >>>>> + goto Out; >>>>> } >>>>> - >>>>> + __smu_cmn_send_msg(smu, (uint16_t) index, param); >>>>> + reg = __smu_cmn_poll_stat(smu); >>>>> + res = __smu_cmn_reg2errno(smu, reg); >>>>> if (read_arg) >>>>> smu_cmn_read_arg(smu, read_arg); >>>>> - >>>>> - ret = 0; /* 0 as driver return value */ >>>>> -out: >>>>> +Out: >>>>> mutex_unlock(&smu->message_lock); >>>>> - return ret; >>>>> + return res; >>>>> } >>>>> >>>>> int smu_cmn_send_smc_msg(struct smu_context *smu, diff --git >>>>> a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h >>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h >>>>> index 9add5f16ff562a..16993daa2ae042 100644 >>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h >>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h >>>>> @@ -27,7 +27,8 @@ >>>>> >>>>> #if defined(SWSMU_CODE_LAYER_L2) || >>> defined(SWSMU_CODE_LAYER_L3) >>>>> || defined(SWSMU_CODE_LAYER_L4) >>>>> int smu_cmn_send_msg_without_waiting(struct smu_context *smu, >>>>> - uint16_t msg, uint32_t param); >>>>> + uint16_t msg_index, >>>>> + uint32_t param); >>>>> int smu_cmn_send_smc_msg_with_param(struct smu_context *smu, >>>>> enum smu_message_type msg, >>>>> uint32_t param, >>>>> -- >>>>> 2.32.0 >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7CLuben.Tuikov%40amd.com%7C0034eccb90234c6ba0d908d946db48a1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637618729900432066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DYigx7bg6kAoZOeQKP2olVlwmrombaAnVpVNDXhqwzg%3D&reserved=0 _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx