[AMD Official Use Only - General] Hi @Koenig, Christian, About the change 'drm/amdgpu: sanitize fence numbers', do you know if this vce issue still exists? Can we change fence process back? Hi @Grodzovsky, Andrey, So looks like close irq is possibly the most appropriate fix for this issue for now? Please help review this part. Thanks, Victor -----Original Message----- From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> Sent: Monday, September 19, 2022 11:04 PM To: Zhao, Victor <Victor.Zhao@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Deng, Emily <Emily.Deng@xxxxxxx> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow I don't know if issue still exist but it worth checking with Christian who wrote this patch. Andrey On 2022-09-16 23:31, Zhao, Victor wrote: > [AMD Official Use Only - General] > > Hi Andrey, > > Yes, moving irq disable can fix the issue. Change in amdgpu_fence_process is just want to make sure driver can correct itself from an overflow situation. Didn’t know about the previous issue there. > > Do you know if the issue still exists? Or is it on VCE only? > > > Thanks, > Victor > > > > -----Original Message----- > From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> > Sent: Friday, September 16, 2022 9:50 PM > To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Zhao, Victor > <Victor.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Deng, Emily <Emily.Deng@xxxxxxx> > Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow > > > On 2022-09-16 01:18, Christian König wrote: >> Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky: >>> On 2022-09-15 15:26, Christian König wrote: >>>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky: >>>>> On 2022-09-15 06:09, Zhao, Victor wrote: >>>>>> [AMD Official Use Only - General] >>>>>> >>>>>> Hi Christian, >>>>>> >>>>>> The test sequence is executing a compute engine hang while >>>>>> running a lot of containers submitting gfx jobs. We have advanced >>>>>> tdr mode and mode2 reset enabled on driver. >>>>>> When a compute hang job timeout happens, the 2 jobs on the gfx >>>>>> pending list maybe signaled after drm_sched_stop. So they will >>>>>> not be removed from pending list but have the >>>>>> DMA_FENCE_FLAG_SIGNALED_BIT set. >>>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will >>>>>> be rerun and removed from pending list. >>>>>> At the resubmit setp, the second job (with signaled bit) will be >>>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done >>>>>> will be called directly. This decrease the hw_rq_count which >>>>>> allows more jobs emitted but did not clean fence_drv rcu ptr. >>>>>> This results in an overflow in the fence_drv. Since we will use >>>>>> num_fences_mask in amdgpu_fence_process, when overflow happens, >>>>>> the signal of some job will be skipped which result in an >>>>>> infinite wait for the fence_drv rcu ptr. >>>>>> >>>>>> So close irq before sched_stop could avoid signal jobs after >>>>>> drm_sched_stop. And signal job one by one in fence_process >>>>>> instead of using a mask will handle the overflow situation. >>>>>> >>>>>> Another fix could be skip submitting jobs which already signaled >>>>>> during resubmit stage, which may look cleaner. >>>>>> >>>>>> Please help give some advice. >>>>> >>>>> How about the code bellow instead ? The real problem is that we >>>>> reuse a dma fence twice which is not according to fma fence >>>>> design, so maybe this can help ? >>>>> >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>> index 8adeb7469f1e..033f0ae16784 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring >>>>> *ring, struct dma_fence **f, struct amd >>>>> if (job && job->job_run_counter) { >>>>> /* reinit seq for resubmitted jobs */ >>>>> fence->seqno = seq; >>>>> + >>>>> + /* For resubmitted job clear the singled bit */ >>>>> + celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, >>>>> &fence->flags); >>>>> + >>>> Upstream will pretty much kill you for that. >>>> >>>> Re-setting a fence from a signaled to an unsignaled state is a >>>> massive no-go. >>>> >>>> Christian. >>> >>> Is it worse then doing fence->seqno = seq; ? This is already a huge >>> hack , no ? >> No, it's as equally bad. I don't think we can do either. >> >> Christian. > > And all those ugly hack are there because we reuse a dma_fence (hw_fence embedded into the job) and correct me if I am wrong but I don't think dma_fence is ever supposed to be reused. > > So maybe like Victor suggested we should move close and flush irq before sched_stop - this in my opinion should solve the issue, but Victor - why then you still need the change in amdgpu_fence_process ? You will not have the overflow situation because by moving irq_disable before stop any job that signaled will be removed from the scheduler pending list anyway. Also not that this change reverts 'drm/amdgpu: sanitize fence numbers' and could reintroduce that bug. > > Andrey > > >>> Andrey >>> >>> >>>>> /* TO be inline with external fence creation and >>>>> other drivers */ >>>>> dma_fence_get(fence); >>>>> } else { >>>>> >>>>> >>>>> Andrey >>>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> Victor >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>>>>> Sent: Thursday, September 15, 2022 2:32 PM >>>>>> To: Zhao, Victor <Victor.Zhao@xxxxxxx>; >>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Grodzovsky, Andrey >>>>>> <Andrey.Grodzovsky@xxxxxxx> >>>>>> Cc: Deng, Emily <Emily.Deng@xxxxxxx> >>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by >>>>>> overflow >>>>>> >>>>>> >>>>>> >>>>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor: >>>>>>> [AMD Official Use Only - General] >>>>>>> >>>>>>> Ping. >>>>>>> >>>>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey, >>>>>>> >>>>>>> We found some reset related issues during stress test on the >>>>>>> sequence. Please help give some comments. >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Victor >>>>>>> >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Victor Zhao <Victor.Zhao@xxxxxxx> >>>>>>> Sent: Wednesday, September 14, 2022 6:10 PM >>>>>>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>> Cc: Deng, Emily <Emily.Deng@xxxxxxx>; Grodzovsky, Andrey >>>>>>> <Andrey.Grodzovsky@xxxxxxx>; Zhao, Victor <Victor.Zhao@xxxxxxx> >>>>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow >>>>>>> >>>>>>> [background] >>>>>>> For a gpu recovery caused by a hang on one ring (e.g. compute), >>>>>>> jobs from another ring (e.g. gfx) may continue signaling during >>>>>>> drm_sched_stop stage. The signal bit will not be cleared. >>>>>>> >>>>>>> At the resubmit stage after recovery, the job with hw fence >>>>>>> signaled bit set will call job done directly instead go through >>>>>>> fence process. >>>>>>> This makes the hw_rq_count decrease but rcu fence pointer not >>>>>>> cleared yet. >>>>>>> >>>>>>> Then overflow happens in the fence driver slots and some jobs >>>>>>> may be skipped and leave the rcu pointer not cleared which makes >>>>>>> an infinite wait for the slot on the next fence emitted. >>>>>>> >>>>>>> This infinite wait cause a job timeout on the emitting job. And >>>>>>> driver will stuck at the its sched stop step because >>>>>>> kthread_park cannot be done. >>>>>>> >>>>>>> [how] >>>>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close >>>>>>> interrupt before drm sched stop 2. handle all fences in fence >>>>>>> process to aviod skip when overflow happens >>>>>>> >>>>>>> Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx> >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 >>>>>>> +++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 6 >>>>>>> +----- >>>>>>> 2 files changed, 14 insertions(+), 8 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> index 943c9e750575..c0cfae52f12b 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct >>>>>>> amdgpu_device *adev, >>>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>>> } >>>>>>> - amdgpu_fence_driver_isr_toggle(adev, true); >>>>>>> - >>>>>>> /* block all schedulers and reset given job's ring */ >>>>>>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>>>>>> struct amdgpu_ring *ring = adev->rings[i]; @@ >>>>>>> -5214,6 >>>>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device >>>>>>> +*adev, >>>>>>> amdgpu_device_ip_need_full_reset(tmp_adev)) >>>>>>> amdgpu_ras_suspend(tmp_adev); >>>>>>> + amdgpu_fence_driver_isr_toggle(tmp_adev, true); >>>>>>> + >>>>>>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>>>>>> struct amdgpu_ring *ring = tmp_adev->rings[i]; >>>>>>> @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct >>>>>>> amdgpu_device *adev, atomic_inc(&tmp_adev->gpu_reset_counter); >>>>>>> } >>>>>>> - if (need_emergency_restart) >>>>>>> + if (need_emergency_restart) { >>>>>>> + list_for_each_entry (tmp_adev, device_list_handle, >>>>>>> reset_list) { >>>>>>> + amdgpu_fence_driver_isr_toggle(tmp_adev, false); >>>>>>> + } >>>>>>> goto skip_sched_resume; >>>>>>> + } >>>>>>> /* >>>>>>> * Must check guilty signal here since after this point >>>>>>> all old @@ -5240,6 +5244,9 @@ int >>>>>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev, >>>>>>> if (job && dma_fence_is_signaled(&job->hw_fence)) { >>>>>>> job_signaled = true; >>>>>>> dev_info(adev->dev, "Guilty job already signaled, >>>>>>> skipping HW reset"); >>>>>>> + list_for_each_entry (tmp_adev, device_list_handle, >>>>>>> reset_list) { >>>>>>> + amdgpu_fence_driver_isr_toggle(tmp_adev, false); >>>>>>> + } >>>>>>> goto skip_hw_reset; >>>>>>> } >>>>>>> @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct >>>>>>> amdgpu_device *adev, >>>>>>> if (r && r == -EAGAIN) { >>>>>>> set_bit(AMDGPU_SKIP_MODE2_RESET, >>>>>>> &reset_context->flags); >>>>>>> adev->asic_reset_res = 0; >>>>>>> + amdgpu_fence_driver_isr_toggle(adev, true); >>>>>>> goto retry; >>>>>>> } >>>>>>> } >>>>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t >>>>>>> amdgpu_pci_slot_reset(struct pci_dev *pdev) >>>>>>> set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags); >>>>>>> set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags); >>>>>>> + amdgpu_fence_driver_isr_toggle(adev, true); >>>>>>> + >>>>>>> adev->no_hw_access = true; >>>>>>> r = amdgpu_device_pre_asic_reset(adev, &reset_context); >>>>>>> adev->no_hw_access = false; diff --git >>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>>>> index 8adeb7469f1e..65a877e1a7fc 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >>>>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct >>>>>>> amdgpu_ring *ring) >>>>>>> if (unlikely(seq == last_seq)) >>>>>>> return false; >>>>>>> - last_seq &= drv->num_fences_mask; >>>>>>> - seq &= drv->num_fences_mask; >>>>>>> - >>>>>>> do { >>>>>>> struct dma_fence *fence, **ptr; >>>>>>> ++last_seq; >>>>>>> - last_seq &= drv->num_fences_mask; >>>>>>> - ptr = &drv->fences[last_seq]; >>>>>>> + ptr = &drv->fences[last_seq & drv->num_fences_mask]; >>>>>>> /* There is always exactly one thread signaling >>>>>>> this fence slot */ >>>>>>> fence = rcu_dereference_protected(*ptr, 1); >>>>>> Those changes here doesn't seem to make sense. Please explain >>>>>> further why that is necessary. >>>>>> >>>>>> Christian. >>>>>> >>>>>>> -- >>>>>>> 2.25.1