Good point as well. How about the attached version? This time we keep an extra reference in amd_sched_process_job() until we are sure that we don't need the s_fence any more. Regards, Christian. Am 13.10.2017 um 11:13 schrieb Liu, Monk: > > your patch looks good, do you think we should also do this: > > > void amd_sched_fence_scheduled(struct amd_sched_fence *fence) > { > - int ret = fence_signal(&fence->scheduled); > + int ret; > + > + fence_get(&fence->scheduled;) > + ret = fence_signal(&fence->scheduled); > > if (!ret) > FENCE_TRACE(&fence->scheduled, "signaled from irq > context\n"); > else > FENCE_TRACE(&fence->scheduled, "was already signaled\n"); > + fence_put(&fence->scheduled); > } > > > ------------------------------------------------------------------------ > *From:* Koenig, Christian > *Sent:* Friday, October 13, 2017 5:00:27 PM > *To:* Liu, Monk; Nicolai Hähnle; amd-gfx at lists.freedesktop.org > *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2 >> >> There is chance that free_job() called before that >> â??trace_amd_sched_process_jobâ??, correct? >> > Correct, but that is harmless. > > Take a look what trace_amd_sched_process_job actually does, it just > prints the pointer of the fence structure (but the pointer might be > stale at this point). > > Nevertheless you are right that this is really ugly. > > How about the attached patch to fix the issue? > > Regards, > Christian. > > Am 13.10.2017 um 10:51 schrieb Liu, Monk: >> >> The free_job() is called in sched_job_finish() which is queued on a >> WORK and scheduled from that â??amd_sched_fence_finished()â?? >> >> So the finishing timing of free_job() is asynchronized with >> sched_process_job() >> >> There is chance that free_job() called before that >> â??trace_amd_sched_process_jobâ??, correct? >> >> And if so the s_fence referred by it maybe a wild pointer >> >> BR Monk >> >> *From:*Liu, Monk >> *Sent:* 2017å¹´10æ??13æ?¥16:49 >> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Nicolai Hähnle >> <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org >> *Subject:* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2 >> >> No thatâ??s not true >> >> The free_job() is called in sched_job_finish() which is queued on a >> WORK and scheduled from that â??amd_sched_fence_finished()â?? >> >> So the finishing timing of free_job() is asynchronized with >> sched_process_job() >> >> How can you sure free_job() must before â??trace_amd_sched_process_jobâ??? >> >> *From:*Koenig, Christian >> *Sent:* 2017å¹´10æ??13æ?¥16:43 >> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; Nicolai >> Hähnle <nhaehnle at gmail.com <mailto:nhaehnle at gmail.com>>; >> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org> >> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2 >> >> The free_job() callback is called only way after the job has finished. >> >> That is one change actually made by you to the code :) >> >> Christian. >> >> Am 13.10.2017 um 10:39 schrieb Liu, Monk: >> >> I doubt it would always work fineâ?¦ >> >> First, we have FENCE_TRACE reference s_fence->finished after >> â??fence_signal(&fence->finished)â?? >> >> Second, we have trace_amd_sched_proess_job(s_fence) after >> â??amd_sched_fence_finished()â??, >> >> If you put the finished before free_job() and by coincidence the >> job_finish() get very soon executed youâ??ll have odds to hit wild >> pointer on above two cases >> >> BR Monk >> >> *From:*Koenig, Christian >> *Sent:* 2017å¹´10æ??13æ?¥16:17 >> *To:* Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>; >> Nicolai Hähnle <nhaehnle at gmail.com> <mailto:nhaehnle at gmail.com>; >> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org> >> *Subject:* Re: regression with >> d6c650c0a8f6f671e49553725e1db541376d95f2 >> >> Yeah, that change is actually incorrect and should be reverted. >> >> What we really need to do is remove dropping sched_job->s_fence >> from amd_sched_process_job() into amd_sched_job_finish() directly >> before the call to free_job(). >> >> Regards, >> Christian. >> >> Am 13.10.2017 um 09:24 schrieb Liu, Monk: >> >> commit d6c650c0a8f6f671e49553725e1db541376d95f2 >> Author: Nicolai Hähnle <nicolai.haehnle at amd.com> >> <mailto:nicolai.haehnle at amd.com> >> @@ -611,6 +611,10 @@ static int amd_sched_main(void *param) >> >> fence = sched->ops->run_job(sched_job); >> amd_sched_fence_scheduled(s_fence); >> + >> + /* amd_sched_process_job drops the job's >> reference of the fence. */ >> + sched_job->s_fence = NULL; >> + >> if (fence) { >> s_fence->parent = dma_fence_get(fence); >> r = dma_fence_add_callback(fence, >> &s_fence->cb, >> >> Hi Nicolai >> >> with this patch, you will break >> "amdgpu_sched_hw_job_reset()"routine: >> >> void >> >> amd_sched_hw_job_reset(structamd_gpu_scheduler >> >> *sched) >> >> { >> >> structamd_sched_job >> >> *s_job; >> >> spin_lock(&sched->job_list_lock); >> >> list_for_each_entry_reverse(s_job, >> >> &sched->ring_mirror_list, node) { >> >> if(s_job->s_fence->parent >> >> && >> >> fence_remove_callback(s_job->s_fence->parent, >> >> &s_job->s_fence->cb)) >> >> { >> >> fence_put(s_job->s_fence->parent); >> >> s_job->s_fence->parent >> >> = >> >> NULL; >> >> atomic_dec(&sched->hw_rq_count); >> >> } >> >> } >> >> spin_unlock(&sched->job_list_lock); >> >> } >> >> see that without sched_job->s_fence, you cannot remove the >> callback from its hw fence, >> >> any idea?? >> >> BR Monk >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171013/7f14204b/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-drm-amd-sched-fix-job-tear-down-order-v2.patch Type: text/x-patch Size: 1853 bytes Desc: not available URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171013/7f14204b/attachment-0001.bin>