Re: [PATCH] drm/amd/pm: Fix a bug communicating with the SMU (v4)

Alex Deucher <alexdeucher@xxxxxxxxx> · Mon, 19 Jul 2021 16:51:41 -0400

On Mon, Jul 19, 2021 at 12:16 PM Luben Tuikov <luben.tuikov@xxxxxxx> wrote:
>
> This fixes a bug which if we probe a non-existing
> I2C device, and the SMU returns 0xFF, from then on
> we can never communicate with the SMU, because the
> code before this patch reads and interprets 0xFF
> as a terminal error, and thus we never write 0
> into register 90 to clear the status (and
> subsequently send a new command to the SMU.)
>
> It is not an error that the SMU returns status
> 0xFF. This means that the SMU executed the last
> command successfully (execution status), but the
> command result is an error of some sort (execution
> result), depending on what the command was.
>
> When doing a status check of the SMU, before we
> send a new command, the only status which
> precludes us from sending a new command is 0--the
> SMU hasn't finished executing a previous command,
> and 0xFC--the SMU is busy.
>
> This bug was seen as the following line in the
> kernel log,
>
> amdgpu: Msg issuing pre-check failed(0xff) and SMU may be not in the right state!
>
> when subsequent SMU commands, not necessarily
> related to I2C, were sent to the SMU.
>
> This patch fixes this bug.
>
> v2: Add a comment to the description of
> __smu_cmn_poll_stat() to explain why we're NOT
> defining the SMU FW return codes as macros, but
> are instead hard-coding them. Such a change, can
> be followed up by a subsequent patch.
>
> v3: The changes are,
> a) Add comments to break labels in
>    __smu_cmn_reg2errno().
>
> b) When an unknown/unspecified/undefined result is
>    returned back from the SMU, map that to
>    -EREMOTEIO, to distinguish failure at the SMU
>    FW.
>
> c) Add kernel-doc to
>    smu_cmn_send_msg_without_waiting(),
>    smu_cmn_wait_for_response(),
>    smu_cmn_send_smc_msg_with_param().
>
> d) In smu_cmn_send_smc_msg_with_param(), since we
>    wait for completion of the command, if the
>    result of the completion is
>    undefined/unknown/unspecified, we print that to
>    the kernel log.
>
> v4: a) Add macros as requested, though redundant, to
>     be removed when SMU consolidates for all
>     ASICs--see comment in code.
>     b) Get out if the SMU code is unknown.
>
> Cc: Alex Deucher <Alexander.Deucher@xxxxxxx>
> Cc: Evan Quan <evan.quan@xxxxxxx>
> Cc: Lijo Lazar <Lijo.Lazar@xxxxxxx>
> Fixes: fcb1fe9c9e0031 ("drm/amd/powerplay: pre-check the SMU state before issuing message")
> Signed-off-by: Luben Tuikov <luben.tuikov@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 294 +++++++++++++++++++++----
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h |   3 +-
>  2 files changed, 250 insertions(+), 47 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index c902fdf322c1be..9b30737bd541d6 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -55,7 +55,7 @@
>
>  #undef __SMU_DUMMY_MAP
>  #define __SMU_DUMMY_MAP(type)  #type
> -static const char* __smu_message_names[] = {
> +static const char * const __smu_message_names[] = {
>         SMU_MESSAGE_TYPES
>  };
>
> @@ -76,55 +76,264 @@ static void smu_cmn_read_arg(struct smu_context *smu,
>         *arg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82);
>  }
>
> -int smu_cmn_wait_for_response(struct smu_context *smu)
> +/* Redefine the SMU error codes here.
> + *
> + * Note that these definitions are redundant and should be removed
> + * when the SMU has exported a unified header file containing these
> + * macros, which header file we can just include and use the SMU's
> + * macros. At the moment, these error codes are defined by the SMU
> + * per-ASIC unfortunately, yet we're a one driver for all ASICs.
> + */
> +
> +#define SMU_BUSY           0

This is really a response per se, it's just that the SMU hasn't
updated the response register yet.
Maybe SMU_RESP_NONE

> +#define SMU_GOOD           1
> +#define SMU_CMD_FAIL       0xFF
> +#define SMU_CMD_UNKNOWN    0xFE
> +#define SMU_CMD_BAD_PREREQ 0xFD
> +#define SMU_BUSY_OTHER     0xFC
> +#define SMU_DEBUG_END      0xFB

I'd change the macros to include RESP for consistency E.g.,
#define SMU_RESP_OK           1
#define SMU_RESP_CMD_FAIL       0xFF
etc.

> +
> +/**
> + * __smu_cmn_poll_stat -- poll for a status from the SMU
> + * smu: a pointer to SMU context
> + *
> + * Returns the status of the SMU, which could be,
> + *    0, the SMU is busy with your previous command;
> + *    1, execution status: success, execution result: success;
> + * 0xFF, execution status: success, execution result: failure;
> + * 0xFE, unknown command;
> + * 0xFD, valid command, but bad (command) prerequisites;
> + * 0xFC, the command was rejected as the SMU is busy;
> + * 0xFB, "SMC_Result_DebugDataDumpEnd".
> + *
> + * The values here are not defined by macros, because I'd rather we
> + * include a single header file which defines them, which is
> + * maintained by the SMU FW team, so that we're impervious to firmware
> + * changes. At the moment those values are defined in various header
> + * files, one for each ASIC, yet here we're a single ASIC-agnostic
> + * interface. Such a change can be followed-up by a subsequent patch.
> + */
> +static u32 __smu_cmn_poll_stat(struct smu_context *smu)
>  {
>         struct amdgpu_device *adev = smu->adev;
> -       uint32_t cur_value, i, timeout = adev->usec_timeout * 20;
> +       int timeout = adev->usec_timeout * 20;
> +       u32 reg;
>
> -       for (i = 0; i < timeout; i++) {
> -               cur_value = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
> -               if ((cur_value & MP1_C2PMSG_90__CONTENT_MASK) != 0)
> -                       return cur_value;
> +       for ( ; timeout > 0; timeout--) {
> +               reg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
> +               if ((reg & MP1_C2PMSG_90__CONTENT_MASK) != 0)
> +                       break;
>
>                 udelay(1);
>         }
>
> -       /* timeout means wrong logic */
> -       if (i == timeout)
> -               return -ETIME;
> -
> -       return RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
> +       return reg;
>  }
>
> -int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
> -                                    uint16_t msg, uint32_t param)
> +static void __smu_cmn_reg_print_error(struct smu_context *smu,
> +                                     u32 reg_c2pmsg_90,
> +                                     int msg_index,
> +                                     u32 param,
> +                                     enum smu_message_type msg)
>  {
>         struct amdgpu_device *adev = smu->adev;
> -       int ret;
> +       const char *message = smu_get_message_name(smu, msg);
>
> -       ret = smu_cmn_wait_for_response(smu);
> -       if (ret != 0x1) {
> -               dev_err(adev->dev, "Msg issuing pre-check failed(0x%x) and "
> -                      "SMU may be not in the right state!\n", ret);
> -               if (ret != -ETIME)
> -                       ret = -EIO;
> -               return ret;
> +       switch (reg_c2pmsg_90) {
> +       case SMU_BUSY:
> +               dev_err_ratelimited(adev->dev,
> +                                   "SMU: I'm not done with your previous command!");
> +               break;
> +       case SMU_GOOD:
> +               /* The SMU executed the command. It completed with a
> +                * successful result.
> +                */
> +               break;
> +       case SMU_CMD_FAIL:
> +               /* The SMU executed the command. It completed with an
> +                * unsuccessful result.
> +                */
> +               break;
> +       case SMU_CMD_UNKNOWN:
> +               dev_err_ratelimited(adev->dev,
> +                                   "SMU: unknown command: index:%d param:0x%08X message:%s",
> +                                   msg_index, param, message);
> +               break;
> +       case SMU_CMD_BAD_PREREQ:
> +               dev_err_ratelimited(adev->dev,
> +                                   "SMU: valid command, bad prerequisites: index:%d param:0x%08X message:%s",
> +                                   msg_index, param, message);
> +               break;
> +       case SMU_BUSY_OTHER:
> +               dev_err_ratelimited(adev->dev,
> +                                   "SMU: I'm very busy for your command: index:%d param:0x%08X message:%s",
> +                                   msg_index, param, message);
> +               break;
> +       case SMU_DEBUG_END:
> +               dev_err_ratelimited(adev->dev,
> +                                   "SMU: I'm debugging!");
> +               break;
> +       default:
> +               dev_err_ratelimited(adev->dev,
> +                                   "SMU: response:0x%08X for index:%d param:0x%08X message:%s?",
> +                                   reg_c2pmsg_90, msg_index, param, message);
> +               break;
> +       }
> +}
> +
> +static int __smu_cmn_reg2errno(struct smu_context *smu, u32 reg_c2pmsg_90)
> +{
> +       int res;
> +
> +       switch (reg_c2pmsg_90) {
> +       case SMU_BUSY:
> +               /* The SMU is busy--still executing your command.
> +                */
> +               res = -ETIME;
> +               break;
> +       case SMU_GOOD:
> +               res = 0;
> +               break;
> +       case SMU_CMD_FAIL:
> +               /* Command completed successfully, but the command
> +                * status was failure.
> +                */
> +               res = -EIO;
> +               break;
> +       case SMU_CMD_UNKNOWN:
> +               /* Unknown command--ignored by the SMU.
> +                */
> +               res = -EOPNOTSUPP;
> +               break;
> +       case SMU_CMD_BAD_PREREQ:
> +               /* Valid command--bad prerequisites.
> +                */
> +               res = -EIO;

Maybe -EINVAL would be better here to differentiate it from other cases.

> +               break;
> +       case SMU_BUSY_OTHER:
> +               /* The SMU is busy with other commands. The client
> +                * should retry in 10 us.
> +                */
> +               res = -EBUSY;
> +               break;
> +       case SMU_DEBUG_END:
> +               /* Debug message.
> +                */
> +               res = -EIO;

Maybe -EREMOTEIO here as well since this seems more logically grouped
with the default.

> +               break;
> +       default:
> +               /* Unknown result from the SMU.
> +                */
> +               res = -EREMOTEIO;
> +               break;
>         }
>
> +       return res;
> +}
> +
> +static void __smu_cmn_send_msg(struct smu_context *smu,
> +                              u16 msg,
> +                              u32 param)
> +{
> +       struct amdgpu_device *adev = smu->adev;
> +
>         WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90, 0);
>         WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82, param);
>         WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_66, msg);
> +}
>
> -       return 0;
> +/**
> + * smu_cmn_send_msg_without_waiting -- send the message; don't wait for status
> + * @smu: pointer to an SMU context
> + * @msg_index: message index
> + * @param: message parameter to send to the SMU
> + *
> + * Send a message to the SMU with the parameter passed. Do not wait
> + * for status/result of the message, thus the "without_waiting".
> + *
> + * Return 0 on success, -errno on error if we weren't able to _send_
> + * the message for some reason. See __smu_cmn_reg2errno() for details
> + * of the -errno.
> + */
> +int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
> +                                    uint16_t msg_index,
> +                                    uint32_t param)
> +{
> +       u32 reg;
> +       int res;
> +
> +       if (smu->adev->in_pci_err_recovery)
> +               return 0;
> +
> +       mutex_lock(&smu->message_lock);
> +       reg = __smu_cmn_poll_stat(smu);
> +       res = __smu_cmn_reg2errno(smu, reg);
> +       if (reg == SMU_BUSY ||
> +           reg == SMU_BUSY_OTHER ||
> +           res == -EREMOTEIO)
> +               goto Out;
> +       __smu_cmn_send_msg(smu, msg_index, param);
> +       res = 0;
> +Out:
> +       mutex_unlock(&smu->message_lock);
> +       return res;
>  }
>
> +/**
> + * smu_cmn_wait_for_response -- wait for response from the SMU
> + * @smu: pointer to an SMU context
> + *
> + * Wait for status from the SMU.
> + *
> + * Return 0 on success, -errno on error, indicating the execution
> + * status and result of the message being waited for. See
> + * __smu_cmn_reg2errno() for details of the -errno.
> + */
> +int smu_cmn_wait_for_response(struct smu_context *smu)
> +{
> +       u32 reg;
> +
> +       reg = __smu_cmn_poll_stat(smu);
> +       return __smu_cmn_reg2errno(smu, reg);
> +}
> +
> +/**
> + * smu_cmn_send_smc_msg_with_param -- send a message with parameter
> + * @smu: pointer to an SMU context
> + * @msg: message to send
> + * @param: parameter to send to the SMU
> + * @read_arg: pointer to u32 to return a value from the SMU back
> + *            to the caller
> + *
> + * Send the message @msg with parameter @param to the SMU, wait for
> + * completion of the command, and return back a value from the SMU in
> + * @read_arg pointer.
> + *
> + * Return 0 on success, -errno on error, if we weren't able to send
> + * the message or if the message completed with some kind of
> + * error. See __smu_cmn_reg2errno() for details of the -errno.
> + *
> + * If we weren't able to send the message to the SMU, we also print
> + * the error to the standard log.
> + *
> + * Command completion status is printed only if the -errno is
> + * -EREMOTEIO, indicating that the SMU returned back an
> + * undefined/unknown/unspecified result. All other cases are
> + * well-defined, not printed, but instead given back to the client to
> + * decide what further to do.
> + *
> + * The return value, @read_arg is read back regardless, to give back
> + * more information to the client, which on error would most likely be
> + * @param, but we can't assume that. This also eliminates more
> + * conditionals.
> + */
>  int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>                                     enum smu_message_type msg,
>                                     uint32_t param,
>                                     uint32_t *read_arg)
>  {
> -       struct amdgpu_device *adev = smu->adev;
> -       int ret = 0, index = 0;
> +       int res, index;
> +       u32 reg;
>
>         if (smu->adev->in_pci_err_recovery)
>                 return 0;
> @@ -136,31 +345,24 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>                 return index == -EACCES ? 0 : index;
>
>         mutex_lock(&smu->message_lock);
> -       ret = smu_cmn_send_msg_without_waiting(smu, (uint16_t)index, param);
> -       if (ret)
> -               goto out;
> -
> -       ret = smu_cmn_wait_for_response(smu);
> -       if (ret != 0x1) {
> -               if (ret == -ETIME) {
> -                       dev_err(adev->dev, "message: %15s (%d) \tparam: 0x%08x is timeout (no response)\n",
> -                               smu_get_message_name(smu, msg), index, param);
> -               } else {
> -                       dev_err(adev->dev, "failed send message: %15s (%d) \tparam: 0x%08x response %#x\n",
> -                               smu_get_message_name(smu, msg), index, param,
> -                               ret);
> -                       ret = -EIO;
> -               }
> -               goto out;
> +       reg = __smu_cmn_poll_stat(smu);
> +       res = __smu_cmn_reg2errno(smu, reg);
> +       if (reg == SMU_BUSY ||
> +           reg == SMU_BUSY_OTHER ||
> +           res == -EREMOTEIO) {
> +               __smu_cmn_reg_print_error(smu, reg, index, param, msg);
> +               goto Out;
>         }
> -
> +       __smu_cmn_send_msg(smu, (uint16_t) index, param);
> +       reg = __smu_cmn_poll_stat(smu);
> +       res = __smu_cmn_reg2errno(smu, reg);
> +       if (res == -EREMOTEIO)
> +               __smu_cmn_reg_print_error(smu, reg, index, param, msg);
>         if (read_arg)
>                 smu_cmn_read_arg(smu, read_arg);
> -
> -       ret = 0; /* 0 as driver return value */
> -out:
> +Out:
>         mutex_unlock(&smu->message_lock);
> -       return ret;
> +       return res;
>  }
>
>  int smu_cmn_send_smc_msg(struct smu_context *smu,
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
> index 9add5f16ff562a..16993daa2ae042 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
> @@ -27,7 +27,8 @@
>
>  #if defined(SWSMU_CODE_LAYER_L2) || defined(SWSMU_CODE_LAYER_L3) || defined(SWSMU_CODE_LAYER_L4)
>  int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
> -                                    uint16_t msg, uint32_t param);
> +                                    uint16_t msg_index,
> +                                    uint32_t param);
>  int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>                                     enum smu_message_type msg,
>                                     uint32_t param,
> --
> 2.32.0
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx