Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Thu, 15 Sep 2022 13:29:30 +0530

On 9/15/2022 12:08 PM, Zhao, Victor wrote:
[AMD Official Use Only - General]

Hi Lijo,

IH resume was added to resolve an issue found during mode2 bring up on sienna cichlid:
- close down mode2 reset and do a mode1 reset first
- open mode2 reset and do a mode2 reset. Mode2 reset was found fail in this case.

Resume IH helps in this case


Sorry, what do you mean by 'close down' mode2 /'open mode2 reset'? Do 
you mean if mode-1 reset is done first, a subsequent mode-2 reset 
doesn't work without IH resume?

Thanks,
Lijo


Thanks,
Victor



-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Thursday, September 15, 2022 1:58 PM
To: Zhao, Victor <Victor.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deng, Emily <Emily.Deng@xxxxxxx>; Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Subject: Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume



On 9/14/2022 3:40 PM, Victor Zhao wrote:
[background]
On current sienna cichlid mode2 reset, on the slow job hang cases,
since page table context was reverted to completely stop gpu, it will
generate page fault interrupt.

Since the irq are open during recovery stage, during ih resume step,
if this interrupt was in processing, which increased ih ring rptr, and
ih resume meanwhile will set rptr and wptr to 0. This may cause

AFAIK, only GFX/SDMA are affected by mode-2. IH is not suspended before mode-2. Why do you resume IH after mode-2 when it is not suspended? Is it a special case for virtualization?

Thanks,
Lijo

rptr greater than wptr. Such case was not handled in ih process, and
it will cause rptr continue increasing util reaches the max.
Such case will make fence fallback situation happen.

[how]
Move the enable of irq after ih resumed and before ib test.
Adjusting the position of enable irq on other reset paths accordingly.

Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 8 ++++----
   drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c | 1 +
   2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c0cfae52f12b..0b658225e9ef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4625,8 +4625,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
   		amdgpu_fence_driver_force_completion(ring);
   	}
   
-	amdgpu_fence_driver_isr_toggle(adev, false);
-
   	if (job && job->vm)
   		drm_sched_increase_karma(&job->base);
   
@@ -4758,6 +4756,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
   		test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
   	skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET,
&reset_context->flags);
   
+	list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+		amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+	}
+
   	/*
   	 * ASIC reset has to be done on all XGMI hive nodes ASAP
   	 * to allow proper links negotiation in FW (within 1 sec) @@
-5031,8 +5033,6 @@ static void amdgpu_device_recheck_guilty_jobs(
   			/* Clear this failed job from fence array */
   			amdgpu_fence_driver_clear_job_fences(ring);
   
-			amdgpu_fence_driver_isr_toggle(adev, false);
-
   			/* Since the job won't signal and we go for
   			 * another resubmit drop this parent pointer
   			 */
diff --git a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
index 7aa570c1ce4a..953036482d1f 100644
--- a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
+++ b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
@@ -240,6 +240,7 @@ sienna_cichlid_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl,
   	* Add this ASIC as tracked as reset was already
   	* complete successfully.
   	*/
+	amdgpu_fence_driver_isr_toggle(tmp_adev, false);
   	amdgpu_register_gpu_instance(tmp_adev);
   
   	/* Resume RAS */