[AMD Official Use Only - General] > -----Original Message----- > From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> > Sent: Monday, March 6, 2023 12:42 AM > To: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Quan, Evan > <Evan.Quan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver > reloading scenario > > > > On 3/2/2023 9:05 PM, Lazar, Lijo wrote: > > > > > > On 3/2/2023 8:56 PM, Deucher, Alexander wrote: > >> [AMD Official Use Only - General] > >> > >>> -----Original Message----- > >>> From: Quan, Evan <Evan.Quan@xxxxxxx> > >>> Sent: Thursday, March 2, 2023 4:31 AM > >>> To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; > >>> Deucher, Alexander <Alexander.Deucher@xxxxxxx> > >>> Subject: RE: [PATCH] drm/amdgpu: disable cstate properly for driver > >>> reloading scenario > >>> > >>> [AMD Official Use Only - General] > >>> > >>> > >>> > >>>> -----Original Message----- > >>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> > >>>> Sent: Thursday, March 2, 2023 5:21 PM > >>>> To: Quan, Evan <Evan.Quan@xxxxxxx>; amd- > gfx@xxxxxxxxxxxxxxxxxxxxx > >>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx> > >>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver > >>>> reloading scenario > >>>> > >>>> > >>>> > >>>> On 3/2/2023 2:43 PM, Quan, Evan wrote: > >>>>> [AMD Official Use Only - General] > >>>>> > >>>>> > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> > >>>>>> Sent: Thursday, March 2, 2023 4:28 PM > >>>>>> To: Quan, Evan <Evan.Quan@xxxxxxx>; amd- > >>> gfx@xxxxxxxxxxxxxxxxxxxxx > >>>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx> > >>>>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for > >>>>>> driver reloading scenario > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 3/2/2023 12:28 PM, Evan Quan wrote: > >>>>>>> Gpu reset might be needed during driver reloading. To guard > >>>>>>> that(gpu > >>>>>>> reset) work, df cstate needs to be disabled properly. > >>>>>>> > >>>>>>> Signed-off-by: Evan Quan <evan.quan@xxxxxxx> > >>>>>>> Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461 > >>>>>>> --- > >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 > +++++++++ > >>>>>>> 1 file changed, 9 insertions(+) > >>>>>>> > >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>>>>>> index 51bbeaa1f311..3c854461ef32 100644 > >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>>>>>> @@ -2816,6 +2816,15 @@ static int > >>>>>>> amdgpu_device_ip_fini_early(struct > >>>>>> amdgpu_device *adev) > >>>>>>> amdgpu_device_set_pg_state(adev, > >>> AMD_PG_STATE_UNGATE); > >>>>>>> amdgpu_device_set_cg_state(adev, > >>> AMD_CG_STATE_UNGATE); > >>>>>>> > >>>>>>> + /* > >>>>>>> + * Get df cstate disabled properly on driver unloading. > >>>>>>> + * Since on the succeeding driver reloading, gpu reset > >>>>>>> +might > >>>>>>> + * be required. And cstate disabled is a prerequisite for > >>>>>>> + * that(gpu reset). > >>>>>>> + */ > >>>>>>> + if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW)) > >>>>>>> + dev_warn(adev->dev, "Failed to disallow df cstate"); > >>>>>>> + > >>>>>> > >>>>>> This looks more like a firmware bug. Driver sends the Unload > >>>>>> message to > >>>> FW. > >>>>>> In that case FW should disable all features including C-state. > >>>>> Driver does not send the Unload message. We want PMFM alive and > >>>>> ready > >>>> for handling possible gpu reset on reloading. > >>>>> > >>>> > >>>> Actually, soc21_need_reset_on_init code itself has a bug. PSP won't > >>>> get unloaded by default on ring destruction. Even if PSP stops, it > >>>> could just keep the heartbeat value as non-zero (just that it won't > >>> increment). > >>>> > >>>> Probably, that needs to be fixed first rather than keeping PMFW > >>>> alive for a reset. > >>> As I remembered, the change(asic reset during reloading) seemed > >>> introduced to address some sriov issues. > >>> @Deucher, Alexander might share more backgrounds about this. > >>> To be honest, I'm not a fan of this(perform asic reset during > >>> reloading). > >> > >> I'm open to doing it a better way. We did it for two reasons: > >> 1. often times the device was left in a weird state after the driver > >> unload/VM killed. Etc. We needed a way to put the device into a > >> known good state so the driver could re-initialize it. Plus, IIRC, > >> on some of the older ASICS, once the SMU or PSP firmware was loaded, > >> there was no way to reload it without a reset so you needed one > >> anyway. This is largely why we have to reset for S4 as well. > >> 2. Some large servers didn't power off PCI devices on reboots to save > >> time. This left the devices with whatever state they had before the > >> system was rebooted which led to driver initialization problems on > >> subsequent boots because the device was in an unknown state. > >> > >> If there is a better way to handle these situations, I'm all for it. > > Hi Alex, > > There is a part of FW running to handle generic reset requests on SOC21 > SOCs. Can we remove the reset-on-reload for soc21 SOCs, unload PMFW as > usual and see how it goes? Sure. Alex > > Thanks, > Lijo > > >> > > > > Are those cases valid still? We have this for GFX9 - reset done only > > for pass through. And some of the GFX9 ASICs are used in large servers. > > > > /* Just return false for soc15 GPUs. Reset does not seem to > > * be necessary. > > */ > > if (!amdgpu_passthrough(adev)) > > return false; > > > > Thanks, > > Lijo > > > >> Alex > >> > >> > >>> > >>> Evan > >>>> > >>>> Thanks, > >>>> Lijo > >>>> > >>>>> BR > >>>>> Evan > >>>>>> > >>>>>> Thanks, > >>>>>> Lijo > >>>>>> > >>>>>>> amdgpu_amdkfd_suspend(adev, false); > >>>>>>> > >>>>>>> /* Workaroud for ASICs need to disable SMC first */