[AMD Official Use Only - General] Hi Andrey, Thank you for your clarifying. The refcount modified by amdgpu_fence_emit on my side is different. I update my code and get the patch 'drm/amdgpu: Follow up change to previous drm scheduler change.' , the " underflow " disappears. My patch is not needed. Thanks, Jiadong -----Original Message----- From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> Sent: Friday, July 15, 2022 11:43 PM To: Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>; Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Huang, Ray <Ray.Huang@xxxxxxx>; Liu, Aaron <Aaron.Liu@xxxxxxx> Subject: Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails On 2022-07-15 05:28, Zhu, Jiadong wrote: > [AMD Official Use Only - General] > > Updated some comments > > -----Original Message----- > From: Zhu, Jiadong > Sent: Friday, July 15, 2022 5:13 PM > To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; > amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Grodzovsky, Andrey > <Andrey.Grodzovsky@xxxxxxx> > Cc: Huang, Ray <Ray.Huang@xxxxxxx>; Liu, Aaron <Aaron.Liu@xxxxxxx> > Subject: RE: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails > > Hi Christian, > > The resubmitted job in function amdgpu_ib_preempt_job_recovery returns the same hw fence because of this commit: > > static void amdgpu_ib_preempt_job_recovery(struct drm_gpu_scheduler *sched) { > struct drm_sched_job *s_job; > struct dma_fence *fence; > > spin_lock(&sched->job_list_lock); > list_for_each_entry(s_job, &sched->pending_list, list) { > fence = sched->ops->run_job(s_job); //fence returned has the same address with swapped fences > dma_fence_put(fence); > } > spin_unlock(&sched->job_list_lock); > } > > > > commit c530b02f39850a639b72d01ebbf7e5d745c60831 > Author: Jack Zhang <Jack.Zhang1@xxxxxxx> > Date: Wed May 12 15:06:35 2021 +0800 > > drm/amd/amdgpu embed hw_fence into amdgpu_job > > Why: Previously hw fence is alloced separately with job. > It caused historical lifetime issues and corner cases. > The ideal situation is to take fence to manage both job > and fence's lifetime, and simplify the design of gpu-scheduler. > > How: > We propose to embed hw_fence into amdgpu_job. > 1. We cover the normal job submission by this method. > 2. For ib_test, and submit without a parent job keep the > legacy way to create a hw fence separately. > v2: > use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is > embedded in a job. > v3: > remove redundant variable ring in amdgpu_job > v4: > add tdr sequence support for this feature. Add a job_run_counter to > indicate whether this job is a resubmit job. > v5 > add missing handling in amdgpu_fence_enable_signaling > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@xxxxxxx> > Signed-off-by: Jack Zhang <Jack.Zhang7@xxxxxxxxxxx> > Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > Reviewed by: Monk Liu <monk.liu@xxxxxxx> > Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> > > > Thus the fence we swapped out is signaled and put twice in the following 2 functions and we get " refcount_t: underflow; use-after-free. " errors latter. > > /* wait for jobs finished */ > amdgpu_fence_wait_empty(ring); //wait on the resubmitted fence which is signaled and put somewhere else. The refcount decreased by 1 after amdgpu_fence_wait_empty. > > /* signal the old fences */ > amdgpu_ib_preempt_signal_fences(fences, length); //signal and put the previous swapped fence, signal would return -22. > > Thanks, > Jiadong Did you have 'drm/amdgpu: Follow up change to previous drm scheduler change.' this commit in your branch while you encountered this problem ? I don't see an underflow issue for the preempted job when inspecting the code with this commit in mind - amdgpu_fence_emit dma_fence_init 1 dma_fence_get(fence) 2 rcu_assign_pointer(*ptr, dma_fence_get(fence) 3 drm_sched_main s_fence->parent = dma_fence_get(fence); 4 dma_fence_put(fence); 3 amdgpu_ib_preempt_job_recovery amdgpu_fence_emit if (job && job->job_run_counter) -> dma_fence_get(fence); 4 rcu_assign_pointer(*ptr, dma_fence_get(fence)); 5 dma_fence_put(fence); 4 amdgpu_fence_wait_empty dma_fence_get_rcu(fence) 5 dma_fence_put(fence) 4 amdgpu_process_fence (EOP interrupt for re-submission of preempted job) dma_fence_put 3 amdgpu_ib_preempt_signal_fences dma_fence_put 2 amdgpu_job_free_cb dma_fence_put(&job->hw_fence) 1 drm_sched_fence_release_scheduled dma_fence_put(fence->parent); 0 Also take a look here for reference - https://drive.google.com/file/d/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view Andrey Andrey > > > -----Original Message----- > From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > Sent: Friday, July 15, 2022 4:48 PM > To: Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; > Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> > Cc: Huang, Ray <Ray.Huang@xxxxxxx>; Liu, Aaron <Aaron.Liu@xxxxxxx> > Subject: Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails > > [CAUTION: External Email] > > Am 15.07.22 um 10:43 schrieb jiadong.zhu@xxxxxxx: >> From: "Jiadong.Zhu" <Jiadong.Zhu@xxxxxxx> >> >> Dma_fence_signal returning non-zero indicates that the fence is >> signaled and put somewhere else. >> Skip dma_fence_put to make the fence refcount correct. > Well quite a big NAK on this. > > Reference counting should be completely independent where a fence signals. > > Andrey can you take a look at this as well? > > Thanks, > Christian. > >> Signed-off-by: Jiadong.Zhu <Jiadong.Zhu@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> index f4ed0785d523..93c1a5e83835 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c >> @@ -1500,8 +1500,8 @@ static void amdgpu_ib_preempt_signal_fences(struct dma_fence **fences, >> fence = fences[i]; >> if (!fence) >> continue; >> - dma_fence_signal(fence); >> - dma_fence_put(fence); >> + if (!dma_fence_signal(fence)) >> + dma_fence_put(fence); >> } >> } >>