Re: [PATCH 2/2] drm/amd/amdgpu: use the default reset for ras recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only - General]


Hi @Deucher, Alexander and @Koenig, Christian

 

Could you help review this patch?

Without this patch, when customer set `reset_method=3` modprobe param to use mode2 reset, ras recovery will also use mode2 reset and skip mode1 reset.

When ECC error happens, GPU can’t be recovered with mode2 reset and mode1 reset is skipped, this will cause GPU reset failure.

 

This patch is to always use mode1 reset for ras recovery (ECC error) when setting `reset_method=3`.

 

Thanks

Sam

 

From: Feng, Kenneth <Kenneth.Feng@xxxxxxx>
Date: Monday, April 29, 2024 at 16:15
To: Feng, Kenneth <Kenneth.Feng@xxxxxxx>, amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>, Zhang, GuoQing (Sam) <GuoQing.Zhang@xxxxxxx>
Cc: Zhang, Owen(SRDC) <Owen.Zhang2@xxxxxxx>, Aldabagh, Maad <Maad.Aldabagh@xxxxxxx>, Ma, Qing (Mark) <Qing.Ma@xxxxxxx>
Subject: RE: [PATCH 2/2] drm/amd/amdgpu: use the default reset for ras recovery

[AMD Official Use Only - General]

+@Zhang, GuoQing (Sam)

-----Original Message-----
From: Kenneth Feng <kenneth.feng@xxxxxxx>
Sent: Monday, April 29, 2024 3:32 PM
To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Zhang, Owen(SRDC) <Owen.Zhang2@xxxxxxx>; Aldabagh, Maad <Maad.Aldabagh@xxxxxxx>; Ma, Qing (Mark) <Qing.Ma@xxxxxxx>; Feng, Kenneth <Kenneth.Feng@xxxxxxx>
Subject: [PATCH 2/2] drm/amd/amdgpu: use the default reset for ras recovery

use the default reset for ras recovery

Signed-off-by: Kenneth Feng <kenneth.feng@xxxxxxx>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index a037e8fba29f..f92b2c4f0d5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2437,6 +2437,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
        struct amdgpu_device *adev = ras->adev;
        struct list_head device_list, *device_list_handle =  NULL;
        struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev);
+       int save_reset_method = amdgpu_reset_method;

        if (hive) {
                atomic_set(&hive->ras_recovery, 1);
@@ -2501,7 +2502,13 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
                        }
                }

+               if (amdgpu_gpu_recovery == 2)
+                       amdgpu_reset_method = -1;
+
                amdgpu_device_gpu_recover(ras->adev, NULL, &reset_context);
+
+               if (amdgpu_gpu_recovery == 2)
+                       amdgpu_reset_method = save_reset_method;
        }
        atomic_set(&ras->in_recovery, 0);
        if (hive) {
--
2.34.1


[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux