On Thu, Nov 18, 2021 at 9:15 AM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote: > > > > On 11/18/2021 7:41 PM, Christian König wrote: > > Am 18.11.21 um 15:09 schrieb Lazar, Lijo: > >> On 11/18/2021 7:36 PM, Alex Deucher wrote: > >>> On Thu, Nov 18, 2021 at 8:11 AM Liang, Prike <Prike.Liang@xxxxxxx> > >>> wrote: > >>>> > >>>> [Public] > >>>> > >>>>> -----Original Message----- > >>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> > >>>>> Sent: Thursday, November 18, 2021 4:01 PM > >>>>> To: Liang, Prike <Prike.Liang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray > >>>>> <Ray.Huang@xxxxxxx> > >>>>> Subject: Re: [PATCH] drm/amdgpu: reset asic after system-wide suspend > >>>>> aborted > >>>>> > >>>>> > >>>>> > >>>>> On 11/18/2021 12:32 PM, Prike Liang wrote: > >>>>>> Do ASIC reset at the moment Sx suspend aborted behind of amdgpu > >>>>>> suspend to keep AMDGPU in a clean reset state and that can avoid > >>>>>> re-initialize device improperly error. > >>>>>> > >>>>>> Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx> > >>>>>> --- > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++ > >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 19 > >>>>> +++++++++++++++++++ > >>>>>> 3 files changed, 24 insertions(+) > >>>>>> > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > >>>>>> index b85b67a..8bd9833 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > >>>>>> @@ -1053,6 +1053,7 @@ struct amdgpu_device { > >>>>>> bool in_s3; > >>>>>> bool in_s4; > >>>>>> bool in_s0ix; > >>>>>> + bool pm_completed; > >>>>> > >>>>> PM framework maintains separate flags, why not use the same? > >>>>> > >>>>> dev->power.is_suspended = false; > >>>>> dev->power.is_noirq_suspended = false; > >>>>> dev->power.is_late_suspended = false; > >>>>> > >>>> > >>>> Thanks for pointing it out and I miss that flag. In this case we can > >>>> use the PM flag is_noirq_suspended for checking AMDGPU device > >>>> whether is finished suspend. > >>>> > >>>>> BTW, Alex posted a patch which does similar thing, though it tries > >>>>> reset if > >>>>> suspend fails. > >>>>> > >>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FDM6PR12MB26195F8E099407B4B6966FEBE4999%40&data=04%7C01%7Clijo.lazar%40amd.com%7C79c861fcfccc4f2cc45d08d9aa9d61f0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637728415203834017%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gGnctl8rlNTj6o02hlE06og3RCA%2BQP37B3ejsZSxPdM%3D&reserved=0 > >>>>> > >>>>> DM6PR12MB2619.namprd12.prod.outlook.com/ > >>>>> > >>>>> If that reset also failed, then no point in another reset, or keep > >>>>> it as part of > >>>>> resume. > >>>>> > >>>> > >>>> Alex's patch seems always do the ASIC reset at the end of AMDGPU > >>>> device, but that may needn't on the normal AMDGPU device suspend. > >>>> However, this patch shows that can handle the system suspend aborted > >>>> after AMDGPU suspend case well, so now seems we only need take care > >>>> suspend abort case here. > >>>> > >>> > >>> Yeah, I was thinking we'd take this patch rather than mine to minimize > >>> the number of resets. > >>> > >> > >> Wondering if suspend fails whether there is a need to call resume. It > >> may not get to resume() to do a reset, that needs to be checked. > > > > I would rather do it so that we reset the ASIC during resume when we > > detect that the hardware is in a bad state. > > > > This way it should also work with things like kexec and virtualization. > > I was thinking from the power framework perspective. If the device's > suspend() callback returns failure, why would the framework needs to > call a resume() on that device. The device's suspend callback succeeds, but some other device or event in the system causes the overall suspend to abort. As such the GPU is never powered off by the platform so it's left in a partially initialized state. The system then calls the resume() callbacks for all of the devices that were previously suspended to bring them back to a working system. GPU reset is just a convenient way to get the device back into a known good state. Unfortunately, I'm not sure if there is a good way to detect whether the GPU is in a known good state or not until we try and re-init the IPs and at that point the IPs are not fully initialized so it's complex to try and unwind that, reset, and try again. It's probably easiest to just reset the GPU on resume all the time. If the GPU was powered down, the reset should work fine since we are resetting a GPU that just came out of reset. If the GPU was not powered down, the reset will put the GPU into a known good state. Alex > > Thanks, > Lijo > > > > > Regards, > > Christian. > > > >> > >> Thanks, > >> Lijo > >> > >> > >>> Alex > >>> > >>> > >>>>> Thanks, > >>>>> Lijo > >>>>> > >>>>>> > >>>>>> atomic_t in_gpu_reset; > >>>>>> enum pp_mp1_state mp1_state; > >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>>>>> index ec42a6f..a12ed54 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>>>>> @@ -3983,6 +3983,10 @@ int amdgpu_device_resume(struct drm_device > >>>>> *dev, bool fbcon) > >>>>>> if (adev->in_s0ix) > >>>>>> amdgpu_gfx_state_change_set(adev, > >>>>> sGpuChangeState_D0Entry); > >>>>>> > >>>>>> + if (!adev->pm_completed) { > >>>>>> + dev_warn(adev->dev, "suspend aborted will do asic > >>>>>> reset\n"); > >>>>>> + amdgpu_asic_reset(adev); > >>>>>> + } > >>>>>> /* post card */ > >>>>>> if (amdgpu_device_need_post(adev)) { > >>>>>> r = amdgpu_device_asic_init(adev); diff --git > >>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>>>> index eee3cf8..9f90017 100644 > >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>>>> @@ -2168,6 +2168,23 @@ static int amdgpu_pmops_suspend(struct > >>>>> device *dev) > >>>>>> return r; > >>>>>> } > >>>>>> > >>>>>> +/* > >>>>>> + * Actually the PM suspend whether is completed should be confirmed > >>>>>> + * by checking the sysfs > >>>>>> +sys/power/suspend_stats/failed_suspend.However, > >>>>>> + * in this function only check the AMDGPU device whether is > >>>>>> suspended > >>>>>> + * completely in the system-wide suspend process. > >>>>>> + */ > >>>>>> +static int amdgpu_pmops_noirq_suspend(struct device *dev) { > >>>>>> + struct drm_device *drm_dev = dev_get_drvdata(dev); > >>>>>> + struct amdgpu_device *adev = drm_to_adev(drm_dev); > >>>>>> + > >>>>>> + dev_dbg(dev, "amdgpu suspend completely.\n"); > >>>>>> + adev->pm_completed = true; > >>>>>> + > >>>>>> + return 0; > >>>>>> +} > >>>>>> + > >>>>>> static int amdgpu_pmops_resume(struct device *dev) > >>>>>> { > >>>>>> struct drm_device *drm_dev = dev_get_drvdata(dev); @@ -2181,6 > >>>>>> +2198,7 @@ static int amdgpu_pmops_resume(struct device *dev) > >>>>>> r = amdgpu_device_resume(drm_dev, true); > >>>>>> if (amdgpu_acpi_is_s0ix_active(adev)) > >>>>>> adev->in_s0ix = false; > >>>>>> + adev->pm_completed = false; > >>>>>> return r; > >>>>>> } > >>>>>> > >>>>>> @@ -2397,6 +2415,7 @@ static const struct dev_pm_ops amdgpu_pm_ops > >>>>> = { > >>>>>> .runtime_suspend = amdgpu_pmops_runtime_suspend, > >>>>>> .runtime_resume = amdgpu_pmops_runtime_resume, > >>>>>> .runtime_idle = amdgpu_pmops_runtime_idle, > >>>>>> + .suspend_noirq = amdgpu_pmops_noirq_suspend, > >>>>>> }; > >>>>>> > >>>>>> static int amdgpu_flush(struct file *f, fl_owner_t id) > >>>>>> > >