Re: [PATCH] drm/amdgpu: reset asic after system-wide suspend aborted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 11/18/2021 7:41 PM, Christian König wrote:
Am 18.11.21 um 15:09 schrieb Lazar, Lijo:
On 11/18/2021 7:36 PM, Alex Deucher wrote:
On Thu, Nov 18, 2021 at 8:11 AM Liang, Prike <Prike.Liang@xxxxxxx> wrote:

[Public]

-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Thursday, November 18, 2021 4:01 PM
To: Liang, Prike <Prike.Liang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray
<Ray.Huang@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: reset asic after system-wide suspend
aborted



On 11/18/2021 12:32 PM, Prike Liang wrote:
Do ASIC reset at the moment Sx suspend aborted behind of amdgpu
suspend to keep AMDGPU in a clean reset state and that can avoid
re-initialize device improperly error.

Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 19
+++++++++++++++++++
   3 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index b85b67a..8bd9833 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1053,6 +1053,7 @@ struct amdgpu_device {
     bool                            in_s3;
     bool                            in_s4;
     bool                            in_s0ix;
+   bool                            pm_completed;

PM framework maintains separate flags, why not use the same?

          dev->power.is_suspended = false;
          dev->power.is_noirq_suspended = false;
          dev->power.is_late_suspended = false;


Thanks for pointing it out and I miss that flag. In this case we can use the PM flag is_noirq_suspended for checking AMDGPU device whether is finished suspend.

BTW, Alex posted a patch which does similar thing, though it tries reset if
suspend fails.

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FDM6PR12MB26195F8E099407B4B6966FEBE4999%40&amp;data=04%7C01%7Clijo.lazar%40amd.com%7C79c861fcfccc4f2cc45d08d9aa9d61f0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637728415203834017%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=gGnctl8rlNTj6o02hlE06og3RCA%2BQP37B3ejsZSxPdM%3D&amp;reserved=0
DM6PR12MB2619.namprd12.prod.outlook.com/

If that reset also failed, then no point in another reset, or keep it as part of
resume.


Alex's patch seems always do the ASIC reset at the end of AMDGPU device, but that may needn't on the normal AMDGPU device suspend. However, this patch shows that can handle the system suspend aborted after AMDGPU suspend case well, so now seems we only need take care suspend abort case here.


Yeah, I was thinking we'd take this patch rather than mine to minimize
the number of resets.


Wondering if suspend fails whether there is a need to call resume. It may not get to resume() to do a reset, that needs to be checked.

I would rather do it so that we reset the ASIC during resume when we detect that the hardware is in a bad state.

This way it should also work with things like kexec and virtualization.

I was thinking from the power framework perspective. If the device's suspend() callback returns failure, why would the framework needs to call a resume() on that device.

Thanks,
Lijo


Regards,
Christian.


Thanks,
Lijo


Alex


Thanks,
Lijo


     atomic_t                        in_gpu_reset;
     enum pp_mp1_state               mp1_state;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ec42a6f..a12ed54 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3983,6 +3983,10 @@ int amdgpu_device_resume(struct drm_device
*dev, bool fbcon)
     if (adev->in_s0ix)
             amdgpu_gfx_state_change_set(adev,
sGpuChangeState_D0Entry);

+   if (!adev->pm_completed) {
+           dev_warn(adev->dev, "suspend aborted will do asic reset\n");
+           amdgpu_asic_reset(adev);
+   }
     /* post card */
     if (amdgpu_device_need_post(adev)) {
             r = amdgpu_device_asic_init(adev); diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index eee3cf8..9f90017 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2168,6 +2168,23 @@ static int amdgpu_pmops_suspend(struct
device *dev)
     return r;
   }

+/*
+ * Actually the PM suspend whether is completed should be confirmed
+ * by checking the sysfs
+sys/power/suspend_stats/failed_suspend.However,
+ * in this function only check the AMDGPU device whether is suspended
+ * completely in the system-wide suspend process.
+ */
+static int amdgpu_pmops_noirq_suspend(struct device *dev) {
+   struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+
+   dev_dbg(dev, "amdgpu suspend completely.\n");
+   adev->pm_completed = true;
+
+   return 0;
+}
+
   static int amdgpu_pmops_resume(struct device *dev)
   {
     struct drm_device *drm_dev = dev_get_drvdata(dev); @@ -2181,6
+2198,7 @@ static int amdgpu_pmops_resume(struct device *dev)
     r = amdgpu_device_resume(drm_dev, true);
     if (amdgpu_acpi_is_s0ix_active(adev))
             adev->in_s0ix = false;
+   adev->pm_completed = false;
     return r;
   }

@@ -2397,6 +2415,7 @@ static const struct dev_pm_ops amdgpu_pm_ops
= {
     .runtime_suspend = amdgpu_pmops_runtime_suspend,
     .runtime_resume = amdgpu_pmops_runtime_resume,
     .runtime_idle = amdgpu_pmops_runtime_idle,
+   .suspend_noirq = amdgpu_pmops_noirq_suspend,
   };

   static int amdgpu_flush(struct file *f, fl_owner_t id)





[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux