On 1/20/2025 23:45, Lazar, Lijo wrote:
On 1/13/2025 9:10 AM, Jiang Liu wrote:
When GPU suspend is aborted, do the same for dGPU as APU to reset
soc15 asic. Otherwise it may cause following errors:
[ 547.229463] amdgpu 0001:81:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[ 555.126827] amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[ 555.126901] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
[ 555.126957] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_4_3> failed -110
[ 555.126959] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
[ 555.126965] PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -110
[ 555.126966] PM: Device 0000:0a:00.0 failed to resume async: error -110
This fix has been tested on Mi308X.
Signed-off-by: Jiang Liu <gerry@xxxxxxxxxxxxxxxxx>
Tested-by: Shuo Liu <shuox.liu@xxxxxxxxxxxxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/soc15.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index a59b4c36cad7..0e1daefd1a8e 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -605,12 +605,10 @@ soc15_asic_reset_method(struct amdgpu_device *adev)
static bool soc15_need_reset_on_resume(struct amdgpu_device *adev)
{
/* Will reset for the following suspend abort cases.
- * 1) Only reset on APU side, dGPU hasn't checked yet.
- * 2) S3 suspend aborted in the normal S3 suspend or
- * performing pm core test.
+ * 1) S3 suspend aborted in the normal S3 suspend
+ * 2) S3 suspend aborted in performing pm core test.
*/
- if (adev->flags & AMD_IS_APU && adev->in_s3 &&
- !pm_resume_via_firmware())
+ if (adev->in_s3 && !pm_resume_via_firmware())
return true;
I don't think this can be applied to all environments. For ex: this may
not be applicable for dGPUs combined with ARM CPUs.
I looked through amdgpu_acpi cases and I'm not sure I agree with this.
On ARM side amdgpu_acpi_is_s3_active() should never return true because
ARM doesn't support PM_SUSPEND_MEM.
This means that amdgpu_choose_low_power_state() shouldn't set is_s3 either.
So I don't think this block will run on ARM side.
That being said, we might have other more fundamental issues to worry
about with suspend/resume than handling aborted suspend/resume when run
on non-x86 so the whole set of amdgpu suspend/resume code might need to
be revisited.