Re: [PATCH] drm/amdgpu: reset asic after system-wide suspend aborted

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Thu, 18 Nov 2021 19:58:22 +0530

On 11/18/2021 7:55 PM, Alex Deucher wrote:
On Thu, Nov 18, 2021 at 9:15 AM Lazar, Lijo <lijo.lazar@xxxxxxx> wrote:



On 11/18/2021 7:41 PM, Christian König wrote:
Am 18.11.21 um 15:09 schrieb Lazar, Lijo:
On 11/18/2021 7:36 PM, Alex Deucher wrote:
On Thu, Nov 18, 2021 at 8:11 AM Liang, Prike <Prike.Liang@xxxxxxx>
wrote:

[Public]

-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Thursday, November 18, 2021 4:01 PM
To: Liang, Prike <Prike.Liang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray
<Ray.Huang@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: reset asic after system-wide suspend
aborted



On 11/18/2021 12:32 PM, Prike Liang wrote:
Do ASIC reset at the moment Sx suspend aborted behind of amdgpu
suspend to keep AMDGPU in a clean reset state and that can avoid
re-initialize device improperly error.

Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 19
+++++++++++++++++++
    3 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index b85b67a..8bd9833 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1053,6 +1053,7 @@ struct amdgpu_device {
      bool                            in_s3;
      bool                            in_s4;
      bool                            in_s0ix;
+   bool                            pm_completed;

PM framework maintains separate flags, why not use the same?

           dev->power.is_suspended = false;
           dev->power.is_noirq_suspended = false;
           dev->power.is_late_suspended = false;


Thanks for pointing it out and I miss that flag. In this case we can
use the PM flag is_noirq_suspended for checking AMDGPU device
whether is finished suspend.

BTW, Alex posted a patch which does similar thing, though it tries
reset if
suspend fails.

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FDM6PR12MB26195F8E099407B4B6966FEBE4999%40&amp;data=04%7C01%7Clijo.lazar%40amd.com%7C2ce211aeeeb448f6cb2c08d9aa9f4741%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637728423343483218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=nyzhGwTJV83YZkit34Bb%2B5tBxGEMvFzCyZPjz8eSDgc%3D&amp;reserved=0

DM6PR12MB2619.namprd12.prod.outlook.com/

If that reset also failed, then no point in another reset, or keep
it as part of
resume.


Alex's patch seems always do the ASIC reset at the end of AMDGPU
device, but that may needn't on the normal AMDGPU device suspend.
However, this patch shows that can handle the system suspend aborted
after AMDGPU suspend case well, so now seems we only need take care
suspend abort case here.


Yeah, I was thinking we'd take this patch rather than mine to minimize
the number of resets.


Wondering if suspend fails whether there is a need to call resume. It
may not get to resume() to do a reset, that needs to be checked.

I would rather do it so that we reset the ASIC during resume when we
detect that the hardware is in a bad state.

This way it should also work with things like kexec and virtualization.

I was thinking from the power framework perspective. If the device's
suspend() callback returns failure, why would the framework needs to
call a resume() on that device.

The device's suspend callback succeeds, but some other device or event
in the system causes the overall suspend to abort.  As such the GPU is
never powered off by the platform so it's left in a partially
initialized state.  The system then calls the resume() callbacks for
all of the devices that were previously suspended to bring them back
to a working system.  GPU reset is just a convenient way to get the
device back into a known good state.  Unfortunately, I'm not sure if
there is a good way to detect whether the GPU is in a known good state
or not until we try and re-init the IPs and at that point the IPs are
not fully initialized so it's complex to try and unwind that, reset,
and try again.  It's probably easiest to just reset the GPU on resume
all the time.  If the GPU was powered down, the reset should work fine
since we are resetting a GPU that just came out of reset.  If the GPU
was not powered down, the reset will put the GPU into a known good
state.


https://elixir.bootlin.com/linux/latest/source/drivers/base/power/main.c#L925

I don't have a machine to trace the path. The flag is set only if 
suspend is succesful.

Thanks,
Lijo

Alex


Thanks,
Lijo


Regards,
Christian.


Thanks,
Lijo


Alex


Thanks,
Lijo


      atomic_t                        in_gpu_reset;
      enum pp_mp1_state               mp1_state;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ec42a6f..a12ed54 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3983,6 +3983,10 @@ int amdgpu_device_resume(struct drm_device
*dev, bool fbcon)
      if (adev->in_s0ix)
              amdgpu_gfx_state_change_set(adev,
sGpuChangeState_D0Entry);

+   if (!adev->pm_completed) {
+           dev_warn(adev->dev, "suspend aborted will do asic
reset\n");
+           amdgpu_asic_reset(adev);
+   }
      /* post card */
      if (amdgpu_device_need_post(adev)) {
              r = amdgpu_device_asic_init(adev); diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index eee3cf8..9f90017 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2168,6 +2168,23 @@ static int amdgpu_pmops_suspend(struct
device *dev)
      return r;
    }

+/*
+ * Actually the PM suspend whether is completed should be confirmed
+ * by checking the sysfs
+sys/power/suspend_stats/failed_suspend.However,
+ * in this function only check the AMDGPU device whether is
suspended
+ * completely in the system-wide suspend process.
+ */
+static int amdgpu_pmops_noirq_suspend(struct device *dev) {
+   struct drm_device *drm_dev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = drm_to_adev(drm_dev);
+
+   dev_dbg(dev, "amdgpu suspend completely.\n");
+   adev->pm_completed = true;
+
+   return 0;
+}
+
    static int amdgpu_pmops_resume(struct device *dev)
    {
      struct drm_device *drm_dev = dev_get_drvdata(dev); @@ -2181,6
+2198,7 @@ static int amdgpu_pmops_resume(struct device *dev)
      r = amdgpu_device_resume(drm_dev, true);
      if (amdgpu_acpi_is_s0ix_active(adev))
              adev->in_s0ix = false;
+   adev->pm_completed = false;
      return r;
    }

@@ -2397,6 +2415,7 @@ static const struct dev_pm_ops amdgpu_pm_ops
= {
      .runtime_suspend = amdgpu_pmops_runtime_suspend,
      .runtime_resume = amdgpu_pmops_runtime_resume,
      .runtime_idle = amdgpu_pmops_runtime_idle,
+   .suspend_noirq = amdgpu_pmops_noirq_suspend,
    };

    static int amdgpu_flush(struct file *f, fl_owner_t id)