[AMD Official Use Only - General] Hi Lijo, Yes, this is what I observed in sienna cichlid. Thanks, Victor -----Original Message----- From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> Sent: Thursday, September 15, 2022 4:00 PM To: Zhao, Victor <Victor.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Deng, Emily <Emily.Deng@xxxxxxx>; Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> Subject: Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume On 9/15/2022 12:08 PM, Zhao, Victor wrote: > [AMD Official Use Only - General] > > Hi Lijo, > > IH resume was added to resolve an issue found during mode2 bring up on sienna cichlid: > - close down mode2 reset and do a mode1 reset first > - open mode2 reset and do a mode2 reset. Mode2 reset was found fail in this case. > > Resume IH helps in this case > Sorry, what do you mean by 'close down' mode2 /'open mode2 reset'? Do you mean if mode-1 reset is done first, a subsequent mode-2 reset doesn't work without IH resume? Thanks, Lijo > > Thanks, > Victor > > > > -----Original Message----- > From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> > Sent: Thursday, September 15, 2022 1:58 PM > To: Zhao, Victor <Victor.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Deng, Emily <Emily.Deng@xxxxxxx>; Grodzovsky, Andrey > <Andrey.Grodzovsky@xxxxxxx> > Subject: Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid > race with ih resume > > > > On 9/14/2022 3:40 PM, Victor Zhao wrote: >> [background] >> On current sienna cichlid mode2 reset, on the slow job hang cases, >> since page table context was reverted to completely stop gpu, it will >> generate page fault interrupt. >> >> Since the irq are open during recovery stage, during ih resume step, >> if this interrupt was in processing, which increased ih ring rptr, >> and ih resume meanwhile will set rptr and wptr to 0. This may cause > > AFAIK, only GFX/SDMA are affected by mode-2. IH is not suspended before mode-2. Why do you resume IH after mode-2 when it is not suspended? Is it a special case for virtualization? > > Thanks, > Lijo > >> rptr greater than wptr. Such case was not handled in ih process, and >> it will cause rptr continue increasing util reaches the max. >> Such case will make fence fallback situation happen. >> >> [how] >> Move the enable of irq after ih resumed and before ib test. >> Adjusting the position of enable irq on other reset paths accordingly. >> >> Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++---- >> drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c | 1 + >> 2 files changed, 5 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index c0cfae52f12b..0b658225e9ef 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -4625,8 +4625,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev, >> amdgpu_fence_driver_force_completion(ring); >> } >> >> - amdgpu_fence_driver_isr_toggle(adev, false); >> - >> if (job && job->vm) >> drm_sched_increase_karma(&job->base); >> >> @@ -4758,6 +4756,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, >> test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags); >> skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET, >> &reset_context->flags); >> >> + list_for_each_entry (tmp_adev, device_list_handle, reset_list) { >> + amdgpu_fence_driver_isr_toggle(tmp_adev, false); >> + } >> + >> /* >> * ASIC reset has to be done on all XGMI hive nodes ASAP >> * to allow proper links negotiation in FW (within 1 sec) @@ >> -5031,8 +5033,6 @@ static void amdgpu_device_recheck_guilty_jobs( >> /* Clear this failed job from fence array */ >> amdgpu_fence_driver_clear_job_fences(ring); >> >> - amdgpu_fence_driver_isr_toggle(adev, false); >> - >> /* Since the job won't signal and we go for >> * another resubmit drop this parent pointer >> */ >> diff --git a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c >> b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c >> index 7aa570c1ce4a..953036482d1f 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c >> +++ b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c >> @@ -240,6 +240,7 @@ sienna_cichlid_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl, >> * Add this ASIC as tracked as reset was already >> * complete successfully. >> */ >> + amdgpu_fence_driver_isr_toggle(tmp_adev, false); >> amdgpu_register_gpu_instance(tmp_adev); >> >> /* Resume RAS */ >>