Re: [PATCH] drm/amd/pm: Fix a bug communicating with the SMU

Alex Deucher <alexdeucher@xxxxxxxxx> · Wed, 14 Jul 2021 11:22:56 -0400

On Tue, Jul 13, 2021 at 10:01 PM Quan, Evan <Evan.Quan@xxxxxxx> wrote:
>
> [AMD Official Use Only]
>
>
>
> > -----Original Message-----
> > From: Tuikov, Luben <Luben.Tuikov@xxxxxxx>
> > Sent: Wednesday, July 14, 2021 1:36 AM
> > To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Alimucaj, Andi
> > <Anduil.Alimucaj@xxxxxxx>; Baluja, Aaron <Aaron.Baluja@xxxxxxx>
> > Subject: Re: [PATCH] drm/amd/pm: Fix a bug communicating with the SMU
> >
> > On 2021-07-13 3:07 a.m., Quan, Evan wrote:
> > > [AMD Official Use Only]
> > >
> > >
> > >
> > >> -----Original Message-----
> > >> From: Tuikov, Luben <Luben.Tuikov@xxxxxxx>
> > >> Sent: Monday, July 12, 2021 11:31 PM
> > >> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > >> Cc: Tuikov, Luben <Luben.Tuikov@xxxxxxx>; Deucher, Alexander
> > >> <Alexander.Deucher@xxxxxxx>; Quan, Evan <Evan.Quan@xxxxxxx>
> > >> Subject: [PATCH] drm/amd/pm: Fix a bug communicating with the SMU
> > >>
> > >> This fixes a bug which if we probe a non-existing I2C device, and the
> > >> SMU returns 0xFF,
> > > [Quan, Evan] Do we have to probe I2C device via issuing commands to SMU
> > and check existence via SMU response?
> >
> > Yes, yes we do.
> >
> > > Is there other more friendly way?
> >
> > No, there isn't.
> >
> > > >from then on
> > >> we can never communicate with the SMU, because the code before this
> > >> patch reads and interprets 0xFF as a terminal error, and thus we
> > >> never write 0 into register 90 to clear the status (and subsequently
> > >> send a new command to the SMU.)
> > >>
> > >> It is not an error that the SMU returns status 0xFF. This means that
> > >> the SMU executed the last command successfully (execution status),
> > >> but the command result is an error of some sort (execution result),
> > >> depending on what the command was.
> > >>
> > >> When doing a status check of the SMU, before we send a new command,
> > >> the only status which precludes us from sending a new command is
> > >> 0--the SMU hasn't finished executing a previous command, and
> > >> 0xFC--the SMU is busy.
> > >>
> > >> This bug was seen as the following line in the kernel log,
> > >>
> > >> amdgpu: Msg issuing pre-check failed(0xff) and SMU may be not in the
> > >> right state!
> > > [Quan, Evan] This was designed to prevent failure scenario from ruin.
> > > Via this, we can prevent those SMU command/response related registers
> > overwritten.
> > > Then PMFW team can known which command invoked the first error.
> >
> > Sorry, this is not correct.
> >
> > The interface cannot block valid access to the SMU firmware, just because a
> > command executed successfully, but with a failed result, i.e. set a clock
> > frequency to such-and-such frequency was executed successfully by the
> > SMU, but the frequency wasn't able to be set as required--the SMU IS NOT
> > BLOCKED, but can execute more commands.
> [Quan, Evan]
> 1. First of all, if the response from SMU is not 1, it means SMU must detected something wrong.
> We(driver) should properly handle that. I do not think it's a good idea to silently ignore that and proceed more commands.
> 2. Please be noticed that many commands(SMU messages) have dependence with each other. It means even if the further command
> can be executed "successfully", it may be not executed in the expected way.
> >
> > If the PMFW team wants to know which command invoked "the first error",
> > they can check this explicitly, they can call smu_cmn_wait_for_response(),
> > just like the reset code does--see the reset code.
> [Quan, Evan] Per my knowledge gained during co-work with PMFW team, they expect no further driver-smu interaction(driver stops issuing command)
>  when something went wrong. As they highly rely on SMU internal statistics for their debugging and further interaction may ruin them.
> >
> > The way commit fcb1fe9c9e0031 modified the code, it blocks access to the
> > SMU for various other users of the SMU, not least of which is the I2C.
> [Quan, Evan] I'm wondering can we just list the I2C case as an exception. I means for the SMU command related with I2C we always assume the response is 1.
> For other commands, we just leave them as they are.

Maybe we need to just bubble this up to the callers and let each one
handle it as appropriate.  I don't think just throwing up our hands on
any error is the right thing to do.  The i2c case is a good example.
In other cases, we could probably just retry the transaction or just
ignore the error.

Alex

>
> BR
> Evan
> >
> > Regards,
> > Luben
> >
> > >
> > > BR
> > > Evan
> > >> when subsequent SMU commands, not necessarily related to I2C, were
> > >> sent to the SMU.
> > >>
> > >> This patch fixes this bug.
> > >>
> > >> Cc: Alex Deucher <Alexander.Deucher@xxxxxxx>
> > >> Cc: Evan Quan <evan.quan@xxxxxxx>
> > >> Fixes: fcb1fe9c9e0031 ("drm/amd/powerplay: pre-check the SMU state
> > >> before issuing message")
> > >> Signed-off-by: Luben Tuikov <luben.tuikov@xxxxxxx>
> > >> ---
> > >>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 196
> > >> +++++++++++++++++++------
> > >>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h |   3 +-
> > >>  2 files changed, 152 insertions(+), 47 deletions(-)
> > >>
> > >> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > >> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > >> index c902fdf322c1be..775eb50a2e49a6 100644
> > >> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > >> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> > >> @@ -55,7 +55,7 @@
> > >>
> > >>  #undef __SMU_DUMMY_MAP
> > >>  #define __SMU_DUMMY_MAP(type)     #type
> > >> -static const char* __smu_message_names[] = {
> > >> +static const char * const __smu_message_names[] = {
> > >>    SMU_MESSAGE_TYPES
> > >>  };
> > >>
> > >> @@ -76,46 +76,161 @@ static void smu_cmn_read_arg(struct
> > smu_context
> > >> *smu,
> > >>    *arg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82);  }
> > >>
> > >> -int smu_cmn_wait_for_response(struct smu_context *smu)
> > >> +/**
> > >> + * __smu_cmn_poll_stat -- poll for a status from the SMU
> > >> + * smu: a pointer to SMU context
> > >> + *
> > >> + * Returns the status of the SMU, which could be,
> > >> + * 0, the SMU is busy with your previous command;
> > >> + * 1,    execution status: success, execution result: success;
> > >> + * 0xFF, execution status: success, execution result: failure;
> > >> + * 0xFE, unknown command;
> > >> + * 0xFD, valid command, but bad (command) prerequisites;
> > >> + * 0xFC, the command was rejected as the SMU is busy;
> > >> + * 0xFB, "SMC_Result_DebugDataDumpEnd".
> > >> + */
> > >> +static u32 __smu_cmn_poll_stat(struct smu_context *smu)
> > >>  {
> > >>    struct amdgpu_device *adev = smu->adev;
> > >> -  uint32_t cur_value, i, timeout = adev->usec_timeout * 20;
> > >> +  int timeout = adev->usec_timeout * 20;
> > >> +  u32 reg;
> > >>
> > >> -  for (i = 0; i < timeout; i++) {
> > >> -          cur_value = RREG32_SOC15(MP1, 0,
> > >> mmMP1_SMN_C2PMSG_90);
> > >> -          if ((cur_value & MP1_C2PMSG_90__CONTENT_MASK) != 0)
> > >> -                  return cur_value;
> > >> +  for ( ; timeout > 0; timeout--) {
> > >> +          reg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
> > >> +          if ((reg & MP1_C2PMSG_90__CONTENT_MASK) != 0)
> > >> +                  break;
> > >>
> > >>            udelay(1);
> > >>    }
> > >>
> > >> -  /* timeout means wrong logic */
> > >> -  if (i == timeout)
> > >> -          return -ETIME;
> > >> -
> > >> -  return RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
> > >> +  return reg;
> > >>  }
> > >>
> > >> -int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
> > >> -                               uint16_t msg, uint32_t param)
> > >> +static void __smu_cmn_reg_print_error(struct smu_context *smu,
> > >> +                                u32 reg_c2pmsg_90,
> > >> +                                int msg_index,
> > >> +                                u32 param,
> > >> +                                enum smu_message_type msg)
> > >>  {
> > >>    struct amdgpu_device *adev = smu->adev;
> > >> -  int ret;
> > >> +  const char *message = smu_get_message_name(smu, msg);
> > >>
> > >> -  ret = smu_cmn_wait_for_response(smu);
> > >> -  if (ret != 0x1) {
> > >> -          dev_err(adev->dev, "Msg issuing pre-check failed(0x%x) and
> > >> "
> > >> -                 "SMU may be not in the right state!\n", ret);
> > >> -          if (ret != -ETIME)
> > >> -                  ret = -EIO;
> > >> -          return ret;
> > >> +  switch (reg_c2pmsg_90) {
> > >> +  case 0:
> > >> +          dev_err_ratelimited(adev->dev,
> > >> +                              "SMU: I'm not done with your previous
> > >> command!");
> > >> +          break;
> > >> +  case 1:
> > >> +          /* The SMU executed the command. It completed with a
> > >> +           * successful result.
> > >> +           */
> > >> +          break;
> > >> +  case 0xFF:
> > >> +          /* The SMU executed the command. It completed with a
> > >> +           * unsuccessful result.
> > >> +           */
> > >> +          break;
> > >> +  case 0xFE:
> > >> +          dev_err_ratelimited(adev->dev,
> > >> +                              "SMU: unknown command: index:%d
> > >> param:0x%08X message:%s",
> > >> +                              msg_index, param, message);
> > >> +          break;
> > >> +  case 0xFD:
> > >> +          dev_err_ratelimited(adev->dev,
> > >> +                              "SMU: valid command, bad prerequisites:
> > >> index:%d param:0x%08X message:%s",
> > >> +                              msg_index, param, message);
> > >> +          break;
> > >> +  case 0xFC:
> > >> +          dev_err_ratelimited(adev->dev,
> > >> +                              "SMU: I'm very busy for your command:
> > >> index:%d param:0x%08X message:%s",
> > >> +                              msg_index, param, message);
> > >> +          break;
> > >> +  case 0xFB:
> > >> +          dev_err_ratelimited(adev->dev,
> > >> +                              "SMU: I'm debugging!");
> > >> +          break;
> > >> +  default:
> > >> +          dev_err_ratelimited(adev->dev,
> > >> +                              "SMU: response:0x%08X for index:%d
> > >> param:0x%08X message:%s?",
> > >> +                              reg_c2pmsg_90, msg_index, param,
> > >> message);
> > >> +          break;
> > >> +  }
> > >> +}
> > >> +
> > >> +static int __smu_cmn_reg2errno(struct smu_context *smu, u32
> > >> reg_c2pmsg_90)
> > >> +{
> > >> +  int res;
> > >> +
> > >> +  switch (reg_c2pmsg_90) {
> > >> +  case 0:
> > >> +          res = -ETIME;
> > >> +          break;
> > >> +  case 1:
> > >> +          res = 0;
> > >> +          break;
> > >> +  case 0xFF:
> > >> +          res = -EIO;
> > >> +          break;
> > >> +  case 0xFE:
> > >> +          res = -EOPNOTSUPP;
> > >> +          break;
> > >> +  case 0xFD:
> > >> +          res = -EIO;
> > >> +          break;
> > >> +  case 0xFC:
> > >> +          res = -EBUSY;
> > >> +          break;
> > >> +  case 0xFB:
> > >> +          res = -EIO;
> > >> +          break;
> > >> +  default:
> > >> +          res = -EIO;
> > >> +          break;
> > >>    }
> > >>
> > >> +  return res;
> > >> +}
> > >> +
> > >> +static void __smu_cmn_send_msg(struct smu_context *smu,
> > >> +                         u16 msg,
> > >> +                         u32 param)
> > >> +{
> > >> +  struct amdgpu_device *adev = smu->adev;
> > >> +
> > >>    WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90, 0);
> > >>    WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82, param);
> > >>    WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_66, msg);
> > >> +}
> > >>
> > >> -  return 0;
> > >> +int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
> > >> +                               uint16_t msg_index,
> > >> +                               uint32_t param)
> > >> +{
> > >> +  u32 reg;
> > >> +  int res;
> > >> +
> > >> +  if (smu->adev->in_pci_err_recovery)
> > >> +          return 0;
> > >> +
> > >> +  mutex_lock(&smu->message_lock);
> > >> +  reg = __smu_cmn_poll_stat(smu);
> > >> +  if (reg == 0 || reg == 0xFC) {
> > >> +          res = __smu_cmn_reg2errno(smu, reg);
> > >> +          goto Out;
> > >> +  }
> > >> +  __smu_cmn_send_msg(smu, msg_index, param);
> > >> +  res = 0;
> > >> +Out:
> > >> +  mutex_unlock(&smu->message_lock);
> > >> +  return res;
> > >> +}
> > >> +
> > >> +int smu_cmn_wait_for_response(struct smu_context *smu) {
> > >> +  u32 reg;
> > >> +
> > >> +  reg = __smu_cmn_poll_stat(smu);
> > >> +  return __smu_cmn_reg2errno(smu, reg);
> > >>  }
> > >>
> > >>  int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
> > @@
> > >> -123,8 +238,8 @@ int smu_cmn_send_smc_msg_with_param(struct
> > >> smu_context *smu,
> > >>                                uint32_t param,
> > >>                                uint32_t *read_arg)
> > >>  {
> > >> -  struct amdgpu_device *adev = smu->adev;
> > >> -  int ret = 0, index = 0;
> > >> +  int res, index;
> > >> +  u32 reg;
> > >>
> > >>    if (smu->adev->in_pci_err_recovery)
> > >>            return 0;
> > >> @@ -136,31 +251,20 @@ int
> > smu_cmn_send_smc_msg_with_param(struct
> > >> smu_context *smu,
> > >>            return index == -EACCES ? 0 : index;
> > >>
> > >>    mutex_lock(&smu->message_lock);
> > >> -  ret = smu_cmn_send_msg_without_waiting(smu, (uint16_t)index,
> > >> param);
> > >> -  if (ret)
> > >> -          goto out;
> > >> -
> > >> -  ret = smu_cmn_wait_for_response(smu);
> > >> -  if (ret != 0x1) {
> > >> -          if (ret == -ETIME) {
> > >> -                  dev_err(adev->dev, "message: %15s (%d) \tparam:
> > >> 0x%08x is timeout (no response)\n",
> > >> -                          smu_get_message_name(smu, msg), index,
> > >> param);
> > >> -          } else {
> > >> -                  dev_err(adev->dev, "failed send message: %15s (%d)
> > >> \tparam: 0x%08x response %#x\n",
> > >> -                          smu_get_message_name(smu, msg), index,
> > >> param,
> > >> -                          ret);
> > >> -                  ret = -EIO;
> > >> -          }
> > >> -          goto out;
> > >> +  reg = __smu_cmn_poll_stat(smu);
> > >> +  if (reg == 0 || reg == 0xFC) {
> > >> +          res = __smu_cmn_reg2errno(smu, reg);
> > >> +          __smu_cmn_reg_print_error(smu, reg, index, param, msg);
> > >> +          goto Out;
> > >>    }
> > >> -
> > >> +  __smu_cmn_send_msg(smu, (uint16_t) index, param);
> > >> +  reg = __smu_cmn_poll_stat(smu);
> > >> +  res = __smu_cmn_reg2errno(smu, reg);
> > >>    if (read_arg)
> > >>            smu_cmn_read_arg(smu, read_arg);
> > >> -
> > >> -  ret = 0; /* 0 as driver return value */
> > >> -out:
> > >> +Out:
> > >>    mutex_unlock(&smu->message_lock);
> > >> -  return ret;
> > >> +  return res;
> > >>  }
> > >>
> > >>  int smu_cmn_send_smc_msg(struct smu_context *smu, diff --git
> > >> a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
> > >> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
> > >> index 9add5f16ff562a..16993daa2ae042 100644
> > >> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
> > >> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
> > >> @@ -27,7 +27,8 @@
> > >>
> > >>  #if defined(SWSMU_CODE_LAYER_L2) ||
> > defined(SWSMU_CODE_LAYER_L3)
> > >> || defined(SWSMU_CODE_LAYER_L4)
> > >>  int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
> > >> -                               uint16_t msg, uint32_t param);
> > >> +                               uint16_t msg_index,
> > >> +                               uint32_t param);
> > >>  int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
> > >>                                enum smu_message_type msg,
> > >>                                uint32_t param,
> > >> --
> > >> 2.32.0
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx