RE: [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only - General]

> -----Original Message-----
> From: Quan, Evan <Evan.Quan@xxxxxxx>
> Sent: Thursday, March 2, 2023 4:31 AM
> To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
> Deucher, Alexander <Alexander.Deucher@xxxxxxx>
> Subject: RE: [PATCH] drm/amdgpu: disable cstate properly for driver
> reloading scenario
> 
> [AMD Official Use Only - General]
> 
> 
> 
> > -----Original Message-----
> > From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> > Sent: Thursday, March 2, 2023 5:21 PM
> > To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>
> > Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
> > reloading scenario
> >
> >
> >
> > On 3/2/2023 2:43 PM, Quan, Evan wrote:
> > > [AMD Official Use Only - General]
> > >
> > >
> > >
> > >> -----Original Message-----
> > >> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> > >> Sent: Thursday, March 2, 2023 4:28 PM
> > >> To: Quan, Evan <Evan.Quan@xxxxxxx>; amd-
> gfx@xxxxxxxxxxxxxxxxxxxxx
> > >> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>
> > >> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
> > >> reloading scenario
> > >>
> > >>
> > >>
> > >> On 3/2/2023 12:28 PM, Evan Quan wrote:
> > >>> Gpu reset might be needed during driver reloading. To guard
> > >>> that(gpu
> > >>> reset) work, df cstate needs to be disabled properly.
> > >>>
> > >>> Signed-off-by: Evan Quan <evan.quan@xxxxxxx>
> > >>> Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
> > >>> ---
> > >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
> > >>>    1 file changed, 9 insertions(+)
> > >>>
> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>> index 51bbeaa1f311..3c854461ef32 100644
> > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>> @@ -2816,6 +2816,15 @@ static int
> > >>> amdgpu_device_ip_fini_early(struct
> > >> amdgpu_device *adev)
> > >>>    	amdgpu_device_set_pg_state(adev,
> AMD_PG_STATE_UNGATE);
> > >>>    	amdgpu_device_set_cg_state(adev,
> AMD_CG_STATE_UNGATE);
> > >>>
> > >>> +	/*
> > >>> +	 * Get df cstate disabled properly on driver unloading.
> > >>> +	 * Since on the succeeding driver reloading, gpu reset might
> > >>> +	 * be required. And cstate disabled is a prerequisite for
> > >>> +	 * that(gpu reset).
> > >>> +	 */
> > >>> +	if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
> > >>> +		dev_warn(adev->dev, "Failed to disallow df cstate");
> > >>> +
> > >>
> > >> This looks more like a firmware bug. Driver sends the Unload
> > >> message to
> > FW.
> > >> In that case FW should disable all features including C-state.
> > > Driver does not send the Unload message. We want PMFM alive and
> > > ready
> > for handling possible gpu reset on reloading.
> > >
> >
> > Actually, soc21_need_reset_on_init code itself has a bug. PSP won't
> > get unloaded by default on ring destruction. Even if PSP stops, it
> > could just keep the heartbeat value as non-zero (just that it won't
> increment).
> >
> > Probably, that needs to be fixed first rather than keeping PMFW alive
> > for a reset.
> As I remembered, the change(asic reset during reloading) seemed
> introduced to address some sriov issues.
> @Deucher, Alexander might share more backgrounds about this.
> To be honest, I'm not a fan of this(perform asic reset during reloading).

I'm open to doing it a better way.  We did it for two reasons:
1. often times the device was left in a weird state after the driver unload/VM killed. Etc.  We needed a way to put the device into a known good state so the driver could re-initialize it.  Plus, IIRC, on some of the older ASICS, once the SMU or PSP firmware was loaded, there was no way to reload it without a reset so you needed one anyway.  This is largely why we have to reset for S4 as well.
2. Some large servers didn't power off PCI devices on reboots to save time.  This left the devices with whatever state they had before the system was rebooted which led to driver initialization problems on subsequent boots because the device was in an unknown state.

If there is a better way to handle these situations, I'm all for it.

Alex


> 
> Evan
> >
> > Thanks,
> > Lijo
> >
> > > BR
> > > Evan
> > >>
> > >> Thanks,
> > >> Lijo
> > >>
> > >>>    	amdgpu_amdkfd_suspend(adev, false);
> > >>>
> > >>>    	/* Workaroud for ASICs need to disable SMC first */




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux