Thanks for sharing previous context on this. By a further code check, I found something interesting in si_dpm_late_init that why there is an early return when dpm_enabled has been true. The sequence to enable temperature range in boot is: 1. In IP hw_init(si_dpm_hw_init) ahead of late_init, set temperature range as part of si_thermal_start_thermal_controller 2. set adev->pm.dpm_enabled to true unconditionally in si_dpm_hw_init 3. In si_dpm_late_init, temperate range setting is still executed as we put a check "if (!adev->pm.dpm_enabled) return 0". Looks we should skip it when dpm including temperature range has been set already. So I guess the random failure in enabling/disabling thermal alert is possibly by amdgpu driver does not check the return value when setting temperature in hw_init phase, FW randomly has not finished the process yet, while immediately, driver issues another same setting cycle to FW, and FW complains/returns an error code to driver. This may explain why a delay can work in such case. Or I am understanding this wrongly due to my limitation? Hi Zhenneng, Additionally, can you please try to modify the check to return early in si_dpm_late_init when adev->pm.dpm_enabled is true? [Also I dropped some public mail lists as looks such issue is amdgpu driver specific]:) > -----Original Message----- > From: 李真能 <lizhenneng@xxxxxxxxxx> > Sent: Monday, March 13, 2023 9:05 AM > To: Chen, Guchun <Guchun.Chen@xxxxxxx>; Deucher, Alexander > <Alexander.Deucher@xxxxxxx> > Cc: David Airlie <airlied@xxxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>; > linux-kernel@xxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; amd- > gfx@xxxxxxxxxxxxxxxxxxxxx; Daniel Vetter <daniel@xxxxxxxx>; Koenig, Christian > <Christian.Koenig@xxxxxxx> > Subject: Re: [PATCH] drm/amdgpu: resove reboot exception for si oland > > This bug is first reported here: > > https://lore.kernel.org/lkml/1a620e7c-5b71-3d16-001a- > 0d79b292aca7@xxxxxxx/ > > I modify the patch accroding mail list's discusstion, and I do reboot test for > tens of thousands of times about 10 machines on arm64, there's no bug > reported. > > 在 2023/3/10 16:18, Chen, Guchun 写道: > >> -----Original Message----- > >> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of > >> Zhenneng Li > >> Sent: Friday, March 10, 2023 3:40 PM > >> To: Deucher, Alexander <Alexander.Deucher@xxxxxxx> > >> Cc: David Airlie <airlied@xxxxxxxx>; Pan, Xinhui > >> <Xinhui.Pan@xxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; > >> dri-devel@xxxxxxxxxxxxxxxxxxxxx; Zhenneng Li <lizhenneng@xxxxxxxxxx>; > >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Daniel Vetter <daniel@xxxxxxxx>; > >> Koenig, Christian <Christian.Koenig@xxxxxxx> > >> Subject: [PATCH] drm/amdgpu: resove reboot exception for si oland > >> > >> During reboot test on arm64 platform, it may failure on boot. > >> > >> The error message are as follows: > >> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init [amdgpu]] > >> *ERROR* > >> late_init of IP block <si_dpm> failed -22 > >> [ 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: > amdgpu_device_ip_late_init > >> failed > >> [ 7.014224][ 7] [ T295] amdgpu 0000:04:00.0: Fatal error during GPU init > >> --- > >> drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 3 --- > >> 1 file changed, 3 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c > >> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c > >> index d6d9e3b1b2c0..dee51c757ac0 100644 > >> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c > >> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c > >> @@ -7632,9 +7632,6 @@ static int si_dpm_late_init(void *handle) > >> if (!adev->pm.dpm_enabled) > >> return 0; > >> > >> - ret = si_set_temperature_range(adev); > >> - if (ret) > >> - return ret; > > si_set_temperature_range should be platform agnostic. Can you please > elaborate more? > > > > Regards, > > Guchun > > > >> #if 0 //TODO ? > >> si_dpm_powergate_uvd(adev, true); > >> #endif > >> -- > >> 2.25.1