The free_job() is called in sched_job_finish() which is queued on a WORK and scheduled from that â??amd_sched_fence_finished()â?? So the finishing timing of free_job() is asynchronized with sched_process_job() There is chance that free_job() called before that â??trace_amd_sched_process_jobâ??, correct? And if so the s_fence referred by it maybe a wild pointer BR Monk From: Liu, Monk Sent: 2017å¹´10æ??13æ?¥ 16:49 To: Koenig, Christian <Christian.Koenig at amd.com>; Nicolai Hähnle <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org Subject: RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2 No thatâ??s not true The free_job() is called in sched_job_finish() which is queued on a WORK and scheduled from that â??amd_sched_fence_finished()â?? So the finishing timing of free_job() is asynchronized with sched_process_job() How can you sure free_job() must before â??trace_amd_sched_process_jobâ?? ? From: Koenig, Christian Sent: 2017å¹´10æ??13æ?¥ 16:43 To: Liu, Monk <Monk.Liu at amd.com<mailto:Monk.Liu at amd.com>>; Nicolai Hähnle <nhaehnle at gmail.com<mailto:nhaehnle at gmail.com>>; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org> Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2 The free_job() callback is called only way after the job has finished. That is one change actually made by you to the code :) Christian. Am 13.10.2017 um 10:39 schrieb Liu, Monk: I doubt it would always work fineâ?¦ First, we have FENCE_TRACE reference s_fence->finished after â??fence_signal(&fence->finished)â?? Second, we have trace_amd_sched_proess_job(s_fence) after â??amd_sched_fence_finished()â??, If you put the finished before free_job() and by coincidence the job_finish() get very soon executed youâ??ll have odds to hit wild pointer on above two cases BR Monk From: Koenig, Christian Sent: 2017å¹´10æ??13æ?¥ 16:17 To: Liu, Monk <Monk.Liu at amd.com><mailto:Monk.Liu at amd.com>; Nicolai Hähnle <nhaehnle at gmail.com><mailto:nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org> Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2 Yeah, that change is actually incorrect and should be reverted. What we really need to do is remove dropping sched_job->s_fence from amd_sched_process_job() into amd_sched_job_finish() directly before the call to free_job(). Regards, Christian. Am 13.10.2017 um 09:24 schrieb Liu, Monk: commit d6c650c0a8f6f671e49553725e1db541376d95f2 Author: Nicolai Hähnle <nicolai.haehnle at amd.com><mailto:nicolai.haehnle at amd.com> @@ -611,6 +611,10 @@ static int amd_sched_main(void *param) fence = sched->ops->run_job(sched_job); amd_sched_fence_scheduled(s_fence); + + /* amd_sched_process_job drops the job's reference of the fence. */ + sched_job->s_fence = NULL; + if (fence) { s_fence->parent = dma_fence_get(fence); r = dma_fence_add_callback(fence, &s_fence->cb, Hi Nicolai with this patch, you will break "amdgpu_sched_hw_job_reset()"routine: void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched) { struct amd_sched_job *s_job; spin_lock(&sched->job_list_lock); list_for_each_entry_reverse(s_job, &sched->ring_mirror_list, node) { if (s_job->s_fence->parent && fence_remove_callback(s_job->s_fence->parent, &s_job->s_fence->cb)) { fence_put(s_job->s_fence->parent); s_job->s_fence->parent = NULL; atomic_dec(&sched->hw_rq_count); } } spin_unlock(&sched->job_list_lock); } see that without sched_job->s_fence, you cannot remove the callback from its hw fence, any idea?? BR Monk -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171013/abd83b32/attachment-0001.html>