[AMD Official Use Only - General] > -----Original Message----- > From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> > Sent: Thursday, November 24, 2022 6:49 PM > To: Quan, Evan <Evan.Quan@xxxxxxx>; 李真能 <lizhenneng@xxxxxxxxxx>; > Michel Dänzer <michel.daenzer@xxxxxxxxxxx>; Koenig, Christian > <Christian.Koenig@xxxxxxx>; Deucher, Alexander > <Alexander.Deucher@xxxxxxx> > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Pan, Xinhui <Xinhui.Pan@xxxxxxx>; > linux-kernel@xxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH] drm/amdgpu: add mb for si > > > > On 11/24/2022 4:11 PM, Lazar, Lijo wrote: > > > > > > On 11/24/2022 3:34 PM, Quan, Evan wrote: > >> [AMD Official Use Only - General] > >> > >> Could the attached patch help? > >> > >> Evan > >>> -----Original Message----- > >>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf > Of ??? > >>> Sent: Friday, November 18, 2022 5:25 PM > >>> To: Michel Dänzer <michel.daenzer@xxxxxxxxxxx>; Koenig, Christian > >>> <Christian.Koenig@xxxxxxx>; Deucher, Alexander > >>> <Alexander.Deucher@xxxxxxx> > >>> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Pan, Xinhui <Xinhui.Pan@xxxxxxx>; > >>> linux-kernel@xxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx > >>> Subject: Re: [PATCH] drm/amdgpu: add mb for si > >>> > >>> > >>> 在 2022/11/18 17:18, Michel Dänzer 写道: > >>>> On 11/18/22 09:01, Christian König wrote: > >>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li: > >>>>>> During reboot test on arm64 platform, it may failure on boot, so > >>>>>> add this mb in smc. > >>>>>> > >>>>>> The error message are as follows: > >>>>>> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init > >>>>>> [amdgpu]] *ERROR* > >>>>>> late_init of IP block <si_dpm> failed -22 [ > >>>>>> 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: > > > > The issue is happening in late_init() which eventually does > > > > ret = si_thermal_enable_alert(adev, false); > > > > Just before this, si_thermal_start_thermal_controller is called in > > hw_init and that enables thermal alert. > > > > Maybe the issue is with enable/disable of thermal alerts in quick > > succession. Adding a delay inside si_thermal_start_thermal_controller > > might help. > > > > On a second look, temperature range is already set as part of > si_thermal_start_thermal_controller in hw_init > https://elixir.bootlin.com/linux/v6.1- > rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780 > > There is no need to set it again here - > > https://elixir.bootlin.com/linux/v6.1- > rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635 > > I think it is safe to remove the call from late_init altogether. Alex/Evan? > [Quan, Evan] Yes, it makes sense to me. But I'm not sure whether that’s related with the issue here. Since per my understandings, if the issue is caused by double calling of thermal_alert enablement, it will fail every time. That cannot explain why adding some delays or a mb() calling can help. BR Evan > Thanks, > Lijo > > > Thanks, > > Lijo > > > >>>>>> amdgpu_device_ip_late_init failed [ 7.014224][ 7] [ T295] amdgpu > >>>>>> 0000:04:00.0: Fatal error during GPU init > >>>>> Memory barries are not supposed to be sprinkled around like this, > you > >>> need to give a detailed explanation why this is necessary. > >>>>> > >>>>> Regards, > >>>>> Christian. > >>>>> > >>>>>> Signed-off-by: Zhenneng Li <lizhenneng@xxxxxxxxxx> > >>>>>> --- > >>>>>> drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++ > >>>>>> 1 file changed, 2 insertions(+) > >>>>>> > >>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>>>>> index 8f994ffa9cd1..c7656f22278d 100644 > >>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c > >>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct > >>>>>> amdgpu_device *adev) > >>>>>> u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL); > >>>>>> u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0); > >>>>>> + mb(); > >>>>>> + > >>>>>> if (!(rst & RST_REG) && !(clk & CK_DISABLE)) > >>>>>> return true; > >>>> In particular, it makes no sense in this specific place, since it > >>>> cannot directly > >>> affect the values of rst & clk. > >>> > >>> I thinks so too. > >>> > >>> But when I do reboot test using nine desktop machines, there maybe > >>> report > >>> this error on one or two machines after Hundreds of times or > >>> Thousands of > >>> times reboot test, at the beginning, I use msleep() instead of mb(), > >>> these > >>> two methods are all works, but I don't know what is the root case. > >>> > >>> I use this method on other verdor's oland card, this error message are > >>> reported again. > >>> > >>> What could be the root reason? > >>> > >>> test environmen: > >>> > >>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87 > >>> > >>> driver: amdgpu > >>> > >>> os: ubuntu 2004 > >>> > >>> platform: arm64 > >>> > >>> kernel: 5.4.18 > >>> > >>>>
<<attachment: winmail.dat>>