RE: [PATCH] drm/amdgpu: resove reboot exception for si oland

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for sharing previous context on this. 

By a further code check, I found something interesting in si_dpm_late_init that why there is an early return when dpm_enabled has been true.
The sequence to enable temperature range in boot is:
1. In IP hw_init(si_dpm_hw_init) ahead of late_init, set temperature range as part of si_thermal_start_thermal_controller
2. set adev->pm.dpm_enabled to true unconditionally in si_dpm_hw_init
3. In si_dpm_late_init, temperate range setting is still executed as we put a check "if (!adev->pm.dpm_enabled) return 0". Looks we should skip it when dpm including temperature range has been set already.

So I guess the random failure in enabling/disabling thermal alert is possibly by amdgpu driver does not check the return value when setting temperature in hw_init phase, FW randomly has not finished the process yet, while immediately, driver issues another same setting cycle to FW, and FW complains/returns an error code to driver. This may explain why a delay can work in such case. Or I am understanding this wrongly due to my limitation?

Hi Zhenneng,

Additionally, can you please try to modify the check to return early in si_dpm_late_init when adev->pm.dpm_enabled is true?

[Also I dropped some public mail lists as looks such issue is amdgpu driver specific]:)

> -----Original Message-----
> From: 李真能 <lizhenneng@xxxxxxxxxx>
> Sent: Monday, March 13, 2023 9:05 AM
> To: Chen, Guchun <Guchun.Chen@xxxxxxx>; Deucher, Alexander
> <Alexander.Deucher@xxxxxxx>
> Cc: David Airlie <airlied@xxxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>;
> linux-kernel@xxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; amd-
> gfx@xxxxxxxxxxxxxxxxxxxxx; Daniel Vetter <daniel@xxxxxxxx>; Koenig, Christian
> <Christian.Koenig@xxxxxxx>
> Subject: Re: [PATCH] drm/amdgpu: resove reboot exception for si oland
> 
> This bug is first reported here:
> 
> https://lore.kernel.org/lkml/1a620e7c-5b71-3d16-001a-
> 0d79b292aca7@xxxxxxx/
> 
> I modify the patch accroding mail list's discusstion,   and I do reboot test for
> tens of thousands of times about 10 machines on arm64,  there's no bug
> reported.
> 
> 在 2023/3/10 16:18, Chen, Guchun 写道:
> >> -----Original Message-----
> >> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
> >> Zhenneng Li
> >> Sent: Friday, March 10, 2023 3:40 PM
> >> To: Deucher, Alexander <Alexander.Deucher@xxxxxxx>
> >> Cc: David Airlie <airlied@xxxxxxxx>; Pan, Xinhui
> >> <Xinhui.Pan@xxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
> >> dri-devel@xxxxxxxxxxxxxxxxxxxxx; Zhenneng Li <lizhenneng@xxxxxxxxxx>;
> >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Daniel Vetter <daniel@xxxxxxxx>;
> >> Koenig, Christian <Christian.Koenig@xxxxxxx>
> >> Subject: [PATCH] drm/amdgpu: resove reboot exception for si oland
> >>
> >> During reboot test on arm64 platform, it may failure on boot.
> >>
> >> The error message are as follows:
> >> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]]
> >> *ERROR*
> >> 			    late_init of IP block <si_dpm> failed -22
> >> [    7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
> amdgpu_device_ip_late_init
> >> failed
> >> [    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU init
> >> ---
> >>   drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 3 ---
> >>   1 file changed, 3 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> >> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> >> index d6d9e3b1b2c0..dee51c757ac0 100644
> >> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> >> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> >> @@ -7632,9 +7632,6 @@ static int si_dpm_late_init(void *handle)
> >>   	if (!adev->pm.dpm_enabled)
> >>   		return 0;
> >>
> >> -	ret = si_set_temperature_range(adev);
> >> -	if (ret)
> >> -		return ret;
> > si_set_temperature_range should be platform agnostic. Can you please
> elaborate more?
> >
> > Regards,
> > Guchun
> >
> >>   #if 0 //TODO ?
> >>   	si_dpm_powergate_uvd(adev, true);
> >>   #endif
> >> --
> >> 2.25.1




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux