Re: [PATCH] drm/amd/pm: Fix a bug communicating with the SMU

Luben Tuikov <luben.tuikov@xxxxxxx> · Wed, 14 Jul 2021 12:54:05 -0400

On 2021-07-14 11:22 a.m., Alex Deucher wrote:
> On Tue, Jul 13, 2021 at 10:01 PM Quan, Evan <Evan.Quan@xxxxxxx> wrote:
>> [AMD Official Use Only]
>>
>>
>>
>>> -----Original Message-----
>>> From: Tuikov, Luben <Luben.Tuikov@xxxxxxx>
>>> Sent: Wednesday, July 14, 2021 1:36 AM
>>> To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Alimucaj, Andi
>>> <Anduil.Alimucaj@xxxxxxx>; Baluja, Aaron <Aaron.Baluja@xxxxxxx>
>>> Subject: Re: [PATCH] drm/amd/pm: Fix a bug communicating with the SMU
>>>
>>> On 2021-07-13 3:07 a.m., Quan, Evan wrote:
>>>> [AMD Official Use Only]
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Tuikov, Luben <Luben.Tuikov@xxxxxxx>
>>>>> Sent: Monday, July 12, 2021 11:31 PM
>>>>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>> Cc: Tuikov, Luben <Luben.Tuikov@xxxxxxx>; Deucher, Alexander
>>>>> <Alexander.Deucher@xxxxxxx>; Quan, Evan <Evan.Quan@xxxxxxx>
>>>>> Subject: [PATCH] drm/amd/pm: Fix a bug communicating with the SMU
>>>>>
>>>>> This fixes a bug which if we probe a non-existing I2C device, and the
>>>>> SMU returns 0xFF,
>>>> [Quan, Evan] Do we have to probe I2C device via issuing commands to SMU
>>> and check existence via SMU response?
>>>
>>> Yes, yes we do.
>>>
>>>> Is there other more friendly way?
>>> No, there isn't.
>>>
>>>> >from then on
>>>>> we can never communicate with the SMU, because the code before this
>>>>> patch reads and interprets 0xFF as a terminal error, and thus we
>>>>> never write 0 into register 90 to clear the status (and subsequently
>>>>> send a new command to the SMU.)
>>>>>
>>>>> It is not an error that the SMU returns status 0xFF. This means that
>>>>> the SMU executed the last command successfully (execution status),
>>>>> but the command result is an error of some sort (execution result),
>>>>> depending on what the command was.
>>>>>
>>>>> When doing a status check of the SMU, before we send a new command,
>>>>> the only status which precludes us from sending a new command is
>>>>> 0--the SMU hasn't finished executing a previous command, and
>>>>> 0xFC--the SMU is busy.
>>>>>
>>>>> This bug was seen as the following line in the kernel log,
>>>>>
>>>>> amdgpu: Msg issuing pre-check failed(0xff) and SMU may be not in the
>>>>> right state!
>>>> [Quan, Evan] This was designed to prevent failure scenario from ruin.
>>>> Via this, we can prevent those SMU command/response related registers
>>> overwritten.
>>>> Then PMFW team can known which command invoked the first error.
>>> Sorry, this is not correct.
>>>
>>> The interface cannot block valid access to the SMU firmware, just because a
>>> command executed successfully, but with a failed result, i.e. set a clock
>>> frequency to such-and-such frequency was executed successfully by the
>>> SMU, but the frequency wasn't able to be set as required--the SMU IS NOT
>>> BLOCKED, but can execute more commands.
>> [Quan, Evan]
>> 1. First of all, if the response from SMU is not 1, it means SMU must detected something wrong.
>> We(driver) should properly handle that. I do not think it's a good idea to silently ignore that and proceed more commands.
>> 2. Please be noticed that many commands(SMU messages) have dependence with each other. It means even if the further command
>> can be executed "successfully", it may be not executed in the expected way.
>>> If the PMFW team wants to know which command invoked "the first error",
>>> they can check this explicitly, they can call smu_cmn_wait_for_response(),
>>> just like the reset code does--see the reset code.
>> [Quan, Evan] Per my knowledge gained during co-work with PMFW team, they expect no further driver-smu interaction(driver stops issuing command)
>>  when something went wrong. As they highly rely on SMU internal statistics for their debugging and further interaction may ruin them.
>>> The way commit fcb1fe9c9e0031 modified the code, it blocks access to the
>>> SMU for various other users of the SMU, not least of which is the I2C.
>> [Quan, Evan] I'm wondering can we just list the I2C case as an exception. I means for the SMU command related with I2C we always assume the response is 1.
>> For other commands, we just leave them as they are.
> Maybe we need to just bubble this up to the callers and let each one
> handle it as appropriate.  I don't think just throwing up our hands on
> any error is the right thing to do.  The i2c case is a good example.
> In other cases, we could probably just retry the transaction or just
> ignore the error.

My thoughts exactly. I'll generate a v3 of the patch to address the issues brought here and elsewhere.

Regards,
Luben

>
> Alex
>
>
>> BR
>> Evan
>>> Regards,
>>> Luben
>>>
>>>> BR
>>>> Evan
>>>>> when subsequent SMU commands, not necessarily related to I2C, were
>>>>> sent to the SMU.
>>>>>
>>>>> This patch fixes this bug.
>>>>>
>>>>> Cc: Alex Deucher <Alexander.Deucher@xxxxxxx>
>>>>> Cc: Evan Quan <evan.quan@xxxxxxx>
>>>>> Fixes: fcb1fe9c9e0031 ("drm/amd/powerplay: pre-check the SMU state
>>>>> before issuing message")
>>>>> Signed-off-by: Luben Tuikov <luben.tuikov@xxxxxxx>
>>>>> ---
>>>>>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 196
>>>>> +++++++++++++++++++------
>>>>>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h |   3 +-
>>>>>  2 files changed, 152 insertions(+), 47 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> index c902fdf322c1be..775eb50a2e49a6 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> @@ -55,7 +55,7 @@
>>>>>
>>>>>  #undef __SMU_DUMMY_MAP
>>>>>  #define __SMU_DUMMY_MAP(type)     #type
>>>>> -static const char* __smu_message_names[] = {
>>>>> +static const char * const __smu_message_names[] = {
>>>>>    SMU_MESSAGE_TYPES
>>>>>  };
>>>>>
>>>>> @@ -76,46 +76,161 @@ static void smu_cmn_read_arg(struct
>>> smu_context
>>>>> *smu,
>>>>>    *arg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82);  }
>>>>>
>>>>> -int smu_cmn_wait_for_response(struct smu_context *smu)
>>>>> +/**
>>>>> + * __smu_cmn_poll_stat -- poll for a status from the SMU
>>>>> + * smu: a pointer to SMU context
>>>>> + *
>>>>> + * Returns the status of the SMU, which could be,
>>>>> + * 0, the SMU is busy with your previous command;
>>>>> + * 1,    execution status: success, execution result: success;
>>>>> + * 0xFF, execution status: success, execution result: failure;
>>>>> + * 0xFE, unknown command;
>>>>> + * 0xFD, valid command, but bad (command) prerequisites;
>>>>> + * 0xFC, the command was rejected as the SMU is busy;
>>>>> + * 0xFB, "SMC_Result_DebugDataDumpEnd".
>>>>> + */
>>>>> +static u32 __smu_cmn_poll_stat(struct smu_context *smu)
>>>>>  {
>>>>>    struct amdgpu_device *adev = smu->adev;
>>>>> -  uint32_t cur_value, i, timeout = adev->usec_timeout * 20;
>>>>> +  int timeout = adev->usec_timeout * 20;
>>>>> +  u32 reg;
>>>>>
>>>>> -  for (i = 0; i < timeout; i++) {
>>>>> -          cur_value = RREG32_SOC15(MP1, 0,
>>>>> mmMP1_SMN_C2PMSG_90);
>>>>> -          if ((cur_value & MP1_C2PMSG_90__CONTENT_MASK) != 0)
>>>>> -                  return cur_value;
>>>>> +  for ( ; timeout > 0; timeout--) {
>>>>> +          reg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
>>>>> +          if ((reg & MP1_C2PMSG_90__CONTENT_MASK) != 0)
>>>>> +                  break;
>>>>>
>>>>>            udelay(1);
>>>>>    }
>>>>>
>>>>> -  /* timeout means wrong logic */
>>>>> -  if (i == timeout)
>>>>> -          return -ETIME;
>>>>> -
>>>>> -  return RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
>>>>> +  return reg;
>>>>>  }
>>>>>
>>>>> -int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
>>>>> -                               uint16_t msg, uint32_t param)
>>>>> +static void __smu_cmn_reg_print_error(struct smu_context *smu,
>>>>> +                                u32 reg_c2pmsg_90,
>>>>> +                                int msg_index,
>>>>> +                                u32 param,
>>>>> +                                enum smu_message_type msg)
>>>>>  {
>>>>>    struct amdgpu_device *adev = smu->adev;
>>>>> -  int ret;
>>>>> +  const char *message = smu_get_message_name(smu, msg);
>>>>>
>>>>> -  ret = smu_cmn_wait_for_response(smu);
>>>>> -  if (ret != 0x1) {
>>>>> -          dev_err(adev->dev, "Msg issuing pre-check failed(0x%x) and
>>>>> "
>>>>> -                 "SMU may be not in the right state!\n", ret);
>>>>> -          if (ret != -ETIME)
>>>>> -                  ret = -EIO;
>>>>> -          return ret;
>>>>> +  switch (reg_c2pmsg_90) {
>>>>> +  case 0:
>>>>> +          dev_err_ratelimited(adev->dev,
>>>>> +                              "SMU: I'm not done with your previous
>>>>> command!");
>>>>> +          break;
>>>>> +  case 1:
>>>>> +          /* The SMU executed the command. It completed with a
>>>>> +           * successful result.
>>>>> +           */
>>>>> +          break;
>>>>> +  case 0xFF:
>>>>> +          /* The SMU executed the command. It completed with a
>>>>> +           * unsuccessful result.
>>>>> +           */
>>>>> +          break;
>>>>> +  case 0xFE:
>>>>> +          dev_err_ratelimited(adev->dev,
>>>>> +                              "SMU: unknown command: index:%d
>>>>> param:0x%08X message:%s",
>>>>> +                              msg_index, param, message);
>>>>> +          break;
>>>>> +  case 0xFD:
>>>>> +          dev_err_ratelimited(adev->dev,
>>>>> +                              "SMU: valid command, bad prerequisites:
>>>>> index:%d param:0x%08X message:%s",
>>>>> +                              msg_index, param, message);
>>>>> +          break;
>>>>> +  case 0xFC:
>>>>> +          dev_err_ratelimited(adev->dev,
>>>>> +                              "SMU: I'm very busy for your command:
>>>>> index:%d param:0x%08X message:%s",
>>>>> +                              msg_index, param, message);
>>>>> +          break;
>>>>> +  case 0xFB:
>>>>> +          dev_err_ratelimited(adev->dev,
>>>>> +                              "SMU: I'm debugging!");
>>>>> +          break;
>>>>> +  default:
>>>>> +          dev_err_ratelimited(adev->dev,
>>>>> +                              "SMU: response:0x%08X for index:%d
>>>>> param:0x%08X message:%s?",
>>>>> +                              reg_c2pmsg_90, msg_index, param,
>>>>> message);
>>>>> +          break;
>>>>> +  }
>>>>> +}
>>>>> +
>>>>> +static int __smu_cmn_reg2errno(struct smu_context *smu, u32
>>>>> reg_c2pmsg_90)
>>>>> +{
>>>>> +  int res;
>>>>> +
>>>>> +  switch (reg_c2pmsg_90) {
>>>>> +  case 0:
>>>>> +          res = -ETIME;
>>>>> +          break;
>>>>> +  case 1:
>>>>> +          res = 0;
>>>>> +          break;
>>>>> +  case 0xFF:
>>>>> +          res = -EIO;
>>>>> +          break;
>>>>> +  case 0xFE:
>>>>> +          res = -EOPNOTSUPP;
>>>>> +          break;
>>>>> +  case 0xFD:
>>>>> +          res = -EIO;
>>>>> +          break;
>>>>> +  case 0xFC:
>>>>> +          res = -EBUSY;
>>>>> +          break;
>>>>> +  case 0xFB:
>>>>> +          res = -EIO;
>>>>> +          break;
>>>>> +  default:
>>>>> +          res = -EIO;
>>>>> +          break;
>>>>>    }
>>>>>
>>>>> +  return res;
>>>>> +}
>>>>> +
>>>>> +static void __smu_cmn_send_msg(struct smu_context *smu,
>>>>> +                         u16 msg,
>>>>> +                         u32 param)
>>>>> +{
>>>>> +  struct amdgpu_device *adev = smu->adev;
>>>>> +
>>>>>    WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90, 0);
>>>>>    WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82, param);
>>>>>    WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_66, msg);
>>>>> +}
>>>>>
>>>>> -  return 0;
>>>>> +int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
>>>>> +                               uint16_t msg_index,
>>>>> +                               uint32_t param)
>>>>> +{
>>>>> +  u32 reg;
>>>>> +  int res;
>>>>> +
>>>>> +  if (smu->adev->in_pci_err_recovery)
>>>>> +          return 0;
>>>>> +
>>>>> +  mutex_lock(&smu->message_lock);
>>>>> +  reg = __smu_cmn_poll_stat(smu);
>>>>> +  if (reg == 0 || reg == 0xFC) {
>>>>> +          res = __smu_cmn_reg2errno(smu, reg);
>>>>> +          goto Out;
>>>>> +  }
>>>>> +  __smu_cmn_send_msg(smu, msg_index, param);
>>>>> +  res = 0;
>>>>> +Out:
>>>>> +  mutex_unlock(&smu->message_lock);
>>>>> +  return res;
>>>>> +}
>>>>> +
>>>>> +int smu_cmn_wait_for_response(struct smu_context *smu) {
>>>>> +  u32 reg;
>>>>> +
>>>>> +  reg = __smu_cmn_poll_stat(smu);
>>>>> +  return __smu_cmn_reg2errno(smu, reg);
>>>>>  }
>>>>>
>>>>>  int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>>> @@
>>>>> -123,8 +238,8 @@ int smu_cmn_send_smc_msg_with_param(struct
>>>>> smu_context *smu,
>>>>>                                uint32_t param,
>>>>>                                uint32_t *read_arg)
>>>>>  {
>>>>> -  struct amdgpu_device *adev = smu->adev;
>>>>> -  int ret = 0, index = 0;
>>>>> +  int res, index;
>>>>> +  u32 reg;
>>>>>
>>>>>    if (smu->adev->in_pci_err_recovery)
>>>>>            return 0;
>>>>> @@ -136,31 +251,20 @@ int
>>> smu_cmn_send_smc_msg_with_param(struct
>>>>> smu_context *smu,
>>>>>            return index == -EACCES ? 0 : index;
>>>>>
>>>>>    mutex_lock(&smu->message_lock);
>>>>> -  ret = smu_cmn_send_msg_without_waiting(smu, (uint16_t)index,
>>>>> param);
>>>>> -  if (ret)
>>>>> -          goto out;
>>>>> -
>>>>> -  ret = smu_cmn_wait_for_response(smu);
>>>>> -  if (ret != 0x1) {
>>>>> -          if (ret == -ETIME) {
>>>>> -                  dev_err(adev->dev, "message: %15s (%d) \tparam:
>>>>> 0x%08x is timeout (no response)\n",
>>>>> -                          smu_get_message_name(smu, msg), index,
>>>>> param);
>>>>> -          } else {
>>>>> -                  dev_err(adev->dev, "failed send message: %15s (%d)
>>>>> \tparam: 0x%08x response %#x\n",
>>>>> -                          smu_get_message_name(smu, msg), index,
>>>>> param,
>>>>> -                          ret);
>>>>> -                  ret = -EIO;
>>>>> -          }
>>>>> -          goto out;
>>>>> +  reg = __smu_cmn_poll_stat(smu);
>>>>> +  if (reg == 0 || reg == 0xFC) {
>>>>> +          res = __smu_cmn_reg2errno(smu, reg);
>>>>> +          __smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>>> +          goto Out;
>>>>>    }
>>>>> -
>>>>> +  __smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>> +  reg = __smu_cmn_poll_stat(smu);
>>>>> +  res = __smu_cmn_reg2errno(smu, reg);
>>>>>    if (read_arg)
>>>>>            smu_cmn_read_arg(smu, read_arg);
>>>>> -
>>>>> -  ret = 0; /* 0 as driver return value */
>>>>> -out:
>>>>> +Out:
>>>>>    mutex_unlock(&smu->message_lock);
>>>>> -  return ret;
>>>>> +  return res;
>>>>>  }
>>>>>
>>>>>  int smu_cmn_send_smc_msg(struct smu_context *smu, diff --git
>>>>> a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
>>>>> index 9add5f16ff562a..16993daa2ae042 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
>>>>> @@ -27,7 +27,8 @@
>>>>>
>>>>>  #if defined(SWSMU_CODE_LAYER_L2) ||
>>> defined(SWSMU_CODE_LAYER_L3)
>>>>> || defined(SWSMU_CODE_LAYER_L4)
>>>>>  int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
>>>>> -                               uint16_t msg, uint32_t param);
>>>>> +                               uint16_t msg_index,
>>>>> +                               uint32_t param);
>>>>>  int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>>>>>                                enum smu_message_type msg,
>>>>>                                uint32_t param,
>>>>> --
>>>>> 2.32.0
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CLuben.Tuikov%40amd.com%7C0034eccb90234c6ba0d908d946db48a1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637618729900432066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=DYigx7bg6kAoZOeQKP2olVlwmrombaAnVpVNDXhqwzg%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx