[Public] One problem is it's also bypassing a valid 0 response which usually means FW may not have completed processing the previous message. What I thought was is it shouldn't even attempt sending a message if it identified a FW hang. Is there a possibility to have the same problem whenever there is SRIOV full access - as in before/after reset etc.? If state == FW_INIT, ignore response state before sending the message. If there is no expected response to a message, make the state to FW_HANG. This part is tricky as what qualifies as a FW hang could change based on the specific SOC's message. Avoiding bool for this reason; to keep it open for having other FW states. If state == FW_HANG don't even attempt to send the message. Move FW state to FW_INIT whenever there is init/resume sequence - hw_init/hw_resume? Thanks, Lijo -----Original Message----- From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Danijel Slivka Sent: Monday, July 8, 2024 1:37 PM To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Slivka, Danijel <Danijel.Slivka@xxxxxxx> Subject: [PATCH] drm/amd/pm: Ignore initial value in smu response register Why: If the reg mmMP1_SMN_C2PMSG_90 is being written to during amdgpu driver load or driver unload, subsequent amdgpu driver load will fail at smu_hw_init. The default of mmMP1_SMN_C2PMSG_90 register at a clean environment is 0x1 and if value differs from expected, amdgpu driver load will fail. How to fix: Ignore the initial value in smu response register before the first smu message is sent, proceed further to send the message. If register holds 0x0 or an unexpected value after smu message was sent set fw_state_hang flag and no further smu messages will be sent. Signed-off-by: Danijel Slivka <danijel.slivka@xxxxxxx> --- drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 1 + drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 7 +++++-- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h index a34c802f52be..bfe08fa0db6d 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h @@ -562,6 +562,7 @@ struct smu_context { uint32_t smc_fw_if_version; uint32_t smc_fw_version; uint32_t smc_fw_caps; + bool smc_fw_state_hang; bool uploading_custom_pp_table; bool dc_controlled_by_gpio; diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c index 5592fd825aa3..9e4e62dcbee7 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c @@ -421,7 +421,7 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu, if (poll) { reg = __smu_cmn_poll_stat(smu); res = __smu_cmn_reg2errno(smu, reg); - if (reg == SMU_RESP_NONE || res == -EREMOTEIO) { + if ((reg == SMU_RESP_NONE || res == -EREMOTEIO) && +smu->smc_fw_state_hang) { __smu_cmn_reg_print_error(smu, reg, index, param, msg); goto Out; } @@ -429,8 +429,11 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu, __smu_cmn_send_msg(smu, (uint16_t) index, param); reg = __smu_cmn_poll_stat(smu); res = __smu_cmn_reg2errno(smu, reg); - if (res != 0) + if (res != 0) { + if (reg == SMU_RESP_NONE || res == -EREMOTEIO) + smu->smc_fw_state_hang = true; __smu_cmn_reg_print_error(smu, reg, index, param, msg); + } if (read_arg) { smu_cmn_read_arg(smu, read_arg); dev_dbg(adev->dev, "smu send message: %s(%d) param: 0x%08x, resp: 0x%08x,\ -- 2.34.1