[PATCH] drm/amdgpu: fix check order ras->in_recovery is earlier than ras feature

Bob Zhou <bob.zhou@xxxxxxx> · Fri, 27 Oct 2023 18:02:44 +0800

Checking ras->in_recovery is earlier than ras feature that causes the
below null pointer issue. So update the check order to fix it.

BUG: kernel NULL pointer dereference, address: 00000000000000e8
RIP: 0010:amdgpu_ras_reset_error_count+0xf6/0x190 [amdgpu]
Call Trace:
 <TASK>
 ? show_regs+0x72/0x90
 ? __die+0x25/0x80
 ? page_fault_oops+0x79/0x190
 ? do_user_addr_fault+0x30c/0x640
 ? __wake_up_klogd.part.0+0x40/0x70
 ? exc_page_fault+0x81/0x1b0
 ? asm_exc_page_fault+0x27/0x30
 ? amdgpu_ras_reset_error_count+0xf6/0x190 [amdgpu]
 ? __pfx_gmc_v9_0_late_init+0x10/0x10 [amdgpu]
 gmc_v9_0_late_init+0x97/0xe0 [amdgpu]

Fixes: be5c7eb10406 ("drm/amdgpu: bypass RAS error reset in some conditions")

Signed-off-by: Bob Zhou <bob.zhou@xxxxxxx>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 303fbb6a48b6..3af50754800d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1229,15 +1229,15 @@ int amdgpu_ras_reset_error_count(struct amdgpu_device *adev,
 		return -EOPNOTSUPP;
 	}
 
+	if (!amdgpu_ras_is_supported(adev, block) ||
+	    !amdgpu_ras_get_mca_debug_mode(adev))
+		return -EOPNOTSUPP;
+
 	/* skip ras error reset in gpu reset */
 	if ((amdgpu_in_reset(adev) || atomic_read(&ras->in_recovery)) &&
 	    mca_funcs && mca_funcs->mca_set_debug_mode)
 		return -EOPNOTSUPP;
 
-	if (!amdgpu_ras_is_supported(adev, block) ||
-	    !amdgpu_ras_get_mca_debug_mode(adev))
-		return -EOPNOTSUPP;
-
 	if (block_obj->hw_ops->reset_ras_error_count)
 		block_obj->hw_ops->reset_ras_error_count(adev);
 
-- 
2.34.1