On Sun, Oct 22, 2023 at 9:05 PM Feng, Kenneth <Kenneth.Feng@xxxxxxx> wrote: > > [AMD Official Use Only - General] > > Thanks Alex, I will make another patch. > And please refer to the comments inline below. > > > -----Original Message----- > From: Alex Deucher <alexdeucher@xxxxxxxxx> > Sent: Friday, October 20, 2023 9:58 PM > To: Feng, Kenneth <Kenneth.Feng@xxxxxxx> > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx> > Subject: Re: [PATCH] drm/amd/pm: fix the high voltage and temperature issue on smu 13 > > Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. > > > On Fri, Oct 20, 2023 at 4:32 AM Kenneth Feng <kenneth.feng@xxxxxxx> wrote: > > > > fix the high voltage and temperature issue after the driver is > > unloaded on smu 13.0.0, smu 13.0.7 and smu 13.0.10 > > > > Signed-off-by: Kenneth Feng <kenneth.feng@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 36 +++++++++++++++---- > > drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 4 +-- > > drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 27 ++++++++++++-- > > drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 1 + > > drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h | 2 ++ > > .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 13 +++++++ > > .../drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c | 8 ++++- > > .../drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c | 8 ++++- > > 8 files changed, 86 insertions(+), 13 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index 31f8c3ead161..c5c892a8b3f9 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -3986,13 +3986,23 @@ int amdgpu_device_init(struct amdgpu_device *adev, > > } > > } > > } else { > > - tmp = amdgpu_reset_method; > > - /* It should do a default reset when loading or reloading the driver, > > - * regardless of the module parameter reset_method. > > - */ > > - amdgpu_reset_method = AMD_RESET_METHOD_NONE; > > - r = amdgpu_asic_reset(adev); > > - amdgpu_reset_method = tmp; > > + switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) { > > + case IP_VERSION(13, 0, 0): > > + case IP_VERSION(13, 0, 7): > > + case IP_VERSION(13, 0, 10): > > + r = psp_gpu_reset(adev); > > + break; > > + default: > > + tmp = amdgpu_reset_method; > > + /* It should do a default reset when loading or reloading the driver, > > + * regardless of the module parameter reset_method. > > + */ > > + amdgpu_reset_method = AMD_RESET_METHOD_NONE; > > + r = amdgpu_asic_reset(adev); > > + amdgpu_reset_method = tmp; > > + break; > > + } > > + > > if (r) { > > dev_err(adev->dev, "asic reset on init failed\n"); > > goto failed; @@ -5945,6 +5955,18 @@ > > int amdgpu_device_baco_exit(struct drm_device *dev) > > return -ENOTSUPP; > > > > ret = amdgpu_dpm_baco_exit(adev); > > + > > + if (!ret) > > + switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) { > > + case IP_VERSION(13, 0, 0): > > + case IP_VERSION(13, 0, 7): > > + case IP_VERSION(13, 0, 10): > > + adev->gfx.is_poweron = false; > > + break; > > + default: > > + break; > > + } > > Maybe better to move this into smu_v13_0_0_baco_exit() so we keep the asic specific details out of the common files? > > > + > > if (ret) > > return ret; > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > > b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > > index 80ca2c05b0b8..3ad38e42773b 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > > @@ -73,7 +73,7 @@ gmc_v11_0_vm_fault_interrupt_state(struct amdgpu_device *adev, > > * fini/suspend, so the overall state doesn't > > * change over the course of suspend/resume. > > */ > > - if (!adev->in_s0ix) > > + if (!adev->in_s0ix && adev->gfx.is_poweron) > > amdgpu_gmc_set_vm_fault_masks(adev, AMDGPU_GFXHUB(0), false); > > break; > > case AMDGPU_IRQ_STATE_ENABLE: > > @@ -85,7 +85,7 @@ gmc_v11_0_vm_fault_interrupt_state(struct amdgpu_device *adev, > > * fini/suspend, so the overall state doesn't > > * change over the course of suspend/resume. > > */ > > - if (!adev->in_s0ix) > > + if (!adev->in_s0ix && adev->gfx.is_poweron) > > amdgpu_gmc_set_vm_fault_masks(adev, AMDGPU_GFXHUB(0), true); > > break; > > default: > > > These changes are probably a valid bug fix on their own. > [Kenneth] - When driver is unloaded, gfx core is powered off first. Then in gmc_hw_fini, the gfxhub interruption operation needs to be skipped. Do we need a separate patch for this? Would this trigger in any other cases? E.g., suspend/resume? If so, I think it makes sense as a standalone bug fix. If not, it's fine to include it in this patch. > > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c > > b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c > > index 7c3356d6da5e..30e5f7161737 100644 > > --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c > > +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c > > @@ -733,7 +733,7 @@ static int smu_early_init(void *handle) > > smu->adev = adev; > > smu->pm_enabled = !!amdgpu_dpm; > > smu->is_apu = false; > > - smu->smu_baco.state = SMU_BACO_STATE_EXIT; > > + smu->smu_baco.state = SMU_BACO_STATE_NONE; > > > I'm not sure I understand this change. Is this just to set the default BACO state? Maybe this would be better as a separate patch. > [Kenneth] - smu->smu_baco.state is needed when driver is unloaded, if it's baco exited, then need to reset MP1_FIRMWARE_FLAG = 0, otherwise MP1_FIRMWARE_FLAG doesn't need to be reset. > Currently by default smu->smu_baco.state is baco exited status, we can't recognize if it's really a hardware baco exited status. Do you think we still need a separate patch for it? No, this makes sense. I just want to verify the reason for the change. It would be nice to include this detail in the commit message. > > > smu->smu_baco.platform_support = false; > > smu->user_dpm_profile.fan_mode = -1; > > > > @@ -1740,10 +1740,25 @@ static int smu_smc_hw_cleanup(struct smu_context *smu) > > return 0; > > } > > > > +static int smu_reset_mp1_state(struct smu_context *smu) { > > + struct amdgpu_device *adev = smu->adev; > > + > > + switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) { > > + case IP_VERSION(13, 0, 0): > > + case IP_VERSION(13, 0, 7): > > + case IP_VERSION(13, 0, 10): > > + return smu_set_mp1_state(smu, PP_MP1_STATE_UNLOAD); > > + default: > > + return 0; > > + } > > +} > > + > > static int smu_hw_fini(void *handle) > > { > > struct amdgpu_device *adev = (struct amdgpu_device *)handle; > > struct smu_context *smu = adev->powerplay.pp_handle; > > + int ret; > > > > if (amdgpu_sriov_vf(adev) && !amdgpu_sriov_is_pp_one_vf(adev)) > > return 0; > > @@ -1761,7 +1776,15 @@ static int smu_hw_fini(void *handle) > > > > adev->pm.dpm_enabled = false; > > > > - return smu_smc_hw_cleanup(smu); > > + ret = smu_smc_hw_cleanup(smu); > > + if (ret) > > + return ret; > > + > > + ret = smu_reset_mp1_state(smu); > > + if (ret) > > + return ret; > > + > > + return 0; > > } > > > > static void smu_late_fini(void *handle) diff --git > > a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h > > b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h > > index 1454eed76604..9f2dbc90b606 100644 > > --- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h > > +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h > > @@ -419,6 +419,7 @@ enum smu_reset_mode { enum smu_baco_state { > > SMU_BACO_STATE_ENTER = 0, > > SMU_BACO_STATE_EXIT, > > + SMU_BACO_STATE_NONE, > > }; > > > > struct smu_baco_context { > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h > > b/drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h > > index cc02f979e9e9..43c7ba68eb50 100644 > > --- a/drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h > > +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h > > @@ -299,5 +299,7 @@ int smu_v13_0_update_pcie_parameters(struct smu_context *smu, > > uint8_t pcie_gen_cap, > > uint8_t pcie_width_cap); > > > > +int smu_v13_0_disable_pmfw_state(struct smu_context* smu); > > + > > #endif > > #endif > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c > > b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c > > index bcb7ab9d2221..0724441e53ef 100644 > > --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c > > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c > > @@ -2473,3 +2473,16 @@ int smu_v13_0_update_pcie_parameters(struct > > smu_context *smu, > > > > return 0; > > } > > + > > +int smu_v13_0_disable_pmfw_state(struct smu_context* smu) { > > + int ret; > > + struct amdgpu_device *adev = smu->adev; > > + > > + WREG32_PCIE(MP1_Public | (smnMP1_FIRMWARE_FLAGS & 0xffffffff), > > + 0); > > + > > + ret = RREG32_PCIE(MP1_Public | > > + (smnMP1_FIRMWARE_FLAGS & > > + 0xffffffff)); > > + > > + return ret == 0 ? 0 : -EINVAL; } > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c > > b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c > > index 47d008cbc186..0a167f70f4bc 100644 > > --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c > > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c > > @@ -2758,7 +2758,13 @@ static int smu_v13_0_0_set_mp1_state(struct > > smu_context *smu, > > > > switch (mp1_state) { > > case PP_MP1_STATE_UNLOAD: > > - ret = smu_cmn_set_mp1_state(smu, mp1_state); > > + ret = smu_cmn_send_smc_msg_with_param(smu, > > + SMU_MSG_PrepareMp1ForUnload, > > + 0x55, > > + NULL); > > + > > + if(!ret && smu->smu_baco.state == SMU_BACO_STATE_EXIT) > > space between if and ( > > > + ret = smu_v13_0_disable_pmfw_state(smu); > > + > > break; > > default: > > /* Ignore others */ > > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c > > b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c > > index b8a7a1d853df..d7a4a03b1e31 100644 > > --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c > > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c > > @@ -2429,7 +2429,13 @@ static int smu_v13_0_7_set_mp1_state(struct > > smu_context *smu, > > > > switch (mp1_state) { > > case PP_MP1_STATE_UNLOAD: > > - ret = smu_cmn_set_mp1_state(smu, mp1_state); > > + ret = smu_cmn_send_smc_msg_with_param(smu, > > + SMU_MSG_PrepareMp1ForUnload, > > + 0x55, > > + NULL); > > + > > + if(!ret && smu->smu_baco.state == SMU_BACO_STATE_EXIT) > > Same here. > > Alex > > > + ret = smu_v13_0_disable_pmfw_state(smu); > > + > > break; > > default: > > /* Ignore others */ > > -- > > 2.34.1 > >