Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 9/14/2022 3:40 PM, Victor Zhao wrote:
[background]
On current sienna cichlid mode2 reset, on the slow job hang cases,
since page table context was reverted to completely stop gpu, it
will generate page fault interrupt.

Since the irq are open during recovery stage, during ih resume step,
if this interrupt was in processing, which increased ih ring rptr,
and ih resume meanwhile will set rptr and wptr to 0. This may cause

AFAIK, only GFX/SDMA are affected by mode-2. IH is not suspended before mode-2. Why do you resume IH after mode-2 when it is not suspended? Is it a special case for virtualization?

Thanks,
Lijo

rptr greater than wptr. Such case was not handled in ih process,
and it will cause rptr continue increasing util reaches the max.
Such case will make fence fallback situation happen.

[how]
Move the enable of irq after ih resumed and before ib test.
Adjusting the position of enable irq on other reset paths accordingly.

Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 8 ++++----
  drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c | 1 +
  2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c0cfae52f12b..0b658225e9ef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4625,8 +4625,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
  		amdgpu_fence_driver_force_completion(ring);
  	}
- amdgpu_fence_driver_isr_toggle(adev, false);
-
  	if (job && job->vm)
  		drm_sched_increase_karma(&job->base);
@@ -4758,6 +4756,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  		test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
  	skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET, &reset_context->flags);
+ list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+		amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+	}
+
  	/*
  	 * ASIC reset has to be done on all XGMI hive nodes ASAP
  	 * to allow proper links negotiation in FW (within 1 sec)
@@ -5031,8 +5033,6 @@ static void amdgpu_device_recheck_guilty_jobs(
  			/* Clear this failed job from fence array */
  			amdgpu_fence_driver_clear_job_fences(ring);
- amdgpu_fence_driver_isr_toggle(adev, false);
-
  			/* Since the job won't signal and we go for
  			 * another resubmit drop this parent pointer
  			 */
diff --git a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
index 7aa570c1ce4a..953036482d1f 100644
--- a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
+++ b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
@@ -240,6 +240,7 @@ sienna_cichlid_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl,
  	* Add this ASIC as tracked as reset was already
  	* complete successfully.
  	*/
+	amdgpu_fence_driver_isr_toggle(tmp_adev, false);
  	amdgpu_register_gpu_instance(tmp_adev);
/* Resume RAS */




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux