Checking ras->in_recovery is earlier than ras feature that causes the below null pointer issue. So update the check order to fix it. BUG: kernel NULL pointer dereference, address: 00000000000000e8 RIP: 0010:amdgpu_ras_reset_error_count+0xf6/0x190 [amdgpu] Call Trace: <TASK> ? show_regs+0x72/0x90 ? __die+0x25/0x80 ? page_fault_oops+0x79/0x190 ? do_user_addr_fault+0x30c/0x640 ? __wake_up_klogd.part.0+0x40/0x70 ? exc_page_fault+0x81/0x1b0 ? asm_exc_page_fault+0x27/0x30 ? amdgpu_ras_reset_error_count+0xf6/0x190 [amdgpu] ? __pfx_gmc_v9_0_late_init+0x10/0x10 [amdgpu] gmc_v9_0_late_init+0x97/0xe0 [amdgpu] Fixes: be5c7eb10406 ("drm/amdgpu: bypass RAS error reset in some conditions") Signed-off-by: Bob Zhou <bob.zhou@xxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 303fbb6a48b6..3af50754800d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -1229,15 +1229,15 @@ int amdgpu_ras_reset_error_count(struct amdgpu_device *adev, return -EOPNOTSUPP; } + if (!amdgpu_ras_is_supported(adev, block) || + !amdgpu_ras_get_mca_debug_mode(adev)) + return -EOPNOTSUPP; + /* skip ras error reset in gpu reset */ if ((amdgpu_in_reset(adev) || atomic_read(&ras->in_recovery)) && mca_funcs && mca_funcs->mca_set_debug_mode) return -EOPNOTSUPP; - if (!amdgpu_ras_is_supported(adev, block) || - !amdgpu_ras_get_mca_debug_mode(adev)) - return -EOPNOTSUPP; - if (block_obj->hw_ops->reset_ras_error_count) block_obj->hw_ops->reset_ras_error_count(adev); -- 2.34.1