Because you didn't try GPU reset feature -----Original Message----- From: amd-gfx [mailto:amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx] On Behalf Of Tom St Denis Sent: 2017å¹´10æ??13æ?¥ 19:11 To: amd-gfx at lists.freedesktop.org Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2 At the very least I don't get the hangs like I used to before the patchset so this patch isn't regressing any of that behaviour. Tom On 13/10/17 07:06 AM, Liu, Monk wrote: > Yeah, this one looks good > > You can put my reviewed-by on it > > *From:*Koenig, Christian > *Sent:* 2017å¹´10æ??13æ?¥18:14 > *To:* Liu, Monk <Monk.Liu at amd.com>; Nicolai Hähnle > <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org > *Subject:* Re: regression with > d6c650c0a8f6f671e49553725e1db541376d95f2 > > Good point as well. How about the attached version? > > This time we keep an extra reference in amd_sched_process_job() until > we are sure that we don't need the s_fence any more. > > Regards, > Christian. > > Am 13.10.2017 um 11:13 schrieb Liu, Monk: > > your patch looks good, do you think we should also do this: > >  void amd_sched_fence_scheduled(struct amd_sched_fence *fence) >  { > -      int ret = fence_signal(&fence->scheduled); > +      int ret; > + > +      fence_get(&fence->scheduled;) > +      ret = fence_signal(&fence->scheduled); > >        if (!ret) >                FENCE_TRACE(&fence->scheduled, "signaled from irq > context\n"); >        else >                FENCE_TRACE(&fence->scheduled, "was already > signaled\n"); > +      fence_put(&fence->scheduled); >  } > > > ---------------------------------------------------------------------- > -- > > *From:*Koenig, Christian > *Sent:* Friday, October 13, 2017 5:00:27 PM > *To:* Liu, Monk; Nicolai Hähnle; amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > *Subject:* Re: regression with > d6c650c0a8f6f671e49553725e1db541376d95f2 > > There is chance that free_job() called before that > â??trace_amd_sched_process_jobâ??, correct? > > Correct, but that is harmless. > > Take a look what trace_amd_sched_process_job actually does, it just > prints the pointer of the fence structure (but the pointer might be > stale at this point). > > Nevertheless you are right that this is really ugly. > > How about the attached patch to fix the issue? > > Regards, > Christian. > > Am 13.10.2017 um 10:51 schrieb Liu, Monk: > > The free_job() is called in sched_job_finish() which is queued > on a WORK and scheduled from that â??amd_sched_fence_finished()â?? > > So the finishing timing of free_job() is asynchronized with > sched_process_job() > > There is chance that free_job() called before that > â??trace_amd_sched_process_jobâ??, correct? > > And if so the s_fence referred by it maybe a wild pointer > > BR Monk > > *From:*Liu, Monk > *Sent:* 2017å¹´10æ??13æ?¥16:49 > *To:* Koenig, Christian <Christian.Koenig at amd.com> > <mailto:Christian.Koenig at amd.com>; Nicolai Hähnle > <nhaehnle at gmail.com> <mailto:nhaehnle at gmail.com>; > amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org> > *Subject:* RE: regression with > d6c650c0a8f6f671e49553725e1db541376d95f2 > > No thatâ??s not true > > The free_job() is called in sched_job_finish() which is queued > on a WORK and scheduled from that â??amd_sched_fence_finished()â?? > > So the finishing timing of free_job() is asynchronized with > sched_process_job() > > How can you sure free_job() must before > â??trace_amd_sched_process_jobâ?? ? > > *From:*Koenig, Christian > *Sent:* 2017å¹´10æ??13æ?¥16:43 > *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; > Nicolai Hähnle <nhaehnle at gmail.com <mailto:nhaehnle at gmail.com>>; > amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org> > *Subject:* Re: regression with > d6c650c0a8f6f671e49553725e1db541376d95f2 > > The free_job() callback is called only way after the job has > finished. > > That is one change actually made by you to the code :) > > Christian. > > Am 13.10.2017 um 10:39 schrieb Liu, Monk: > > I doubt it would always work fineâ?¦ > > First, we have FENCE_TRACE reference s_fence->finished after > â??fence_signal(&fence->finished)â?? > > Second, we have trace_amd_sched_proess_job(s_fence) after > â??amd_sched_fence_finished()â??, > > If you put the finished before free_job() and by coincidence > the job_finish() get very soon executed youâ??ll have odds to > hit wild pointer on above two cases > > BR Monk > > *From:*Koenig, Christian > *Sent:* 2017å¹´10æ??13æ?¥16:17 > *To:* Liu, Monk <Monk.Liu at amd.com> > <mailto:Monk.Liu at amd.com>; Nicolai Hähnle > <nhaehnle at gmail.com> <mailto:nhaehnle at gmail.com>; > amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > *Subject:* Re: regression with > d6c650c0a8f6f671e49553725e1db541376d95f2 > > Yeah, that change is actually incorrect and should be reverted. > > What we really need to do is remove dropping > sched_job->s_fence from amd_sched_process_job() into > amd_sched_job_finish() directly before the call to free_job(). > > Regards, > Christian. > > Am 13.10.2017 um 09:24 schrieb Liu, Monk: > > commit d6c650c0a8f6f671e49553725e1db541376d95f2 > Author: Nicolai Hähnle <nicolai.haehnle at amd.com> > <mailto:nicolai.haehnle at amd.com> > @@ -611,6 +611,10 @@ static int amd_sched_main(void > *param) > >                fence = sched->ops->run_job(sched_job); >                amd_sched_fence_scheduled(s_fence); > + > +              /* amd_sched_process_job drops the job's > reference of the fence. */ > +              sched_job->s_fence = NULL; > + >                if (fence) { >                        s_fence->parent = > dma_fence_get(fence); >                        r = > dma_fence_add_callback(fence, &s_fence->cb, > > Hi Nicolai > > with this patch, you will break > "amdgpu_sched_hw_job_reset()"routine: > > void > > amd_sched_hw_job_reset(struct amd_gpu_scheduler > > *sched) > > { > >     struct amd_sched_job > > *s_job; > >     spin_lock(&sched->job_list_lock); > >     list_for_each_entry_reverse(s_job, > > &sched->ring_mirror_list, node) { > >         if (s_job->s_fence->parent > > && > >         > fence_remove_callback(s_job->s_fence->parent, > >                     &s_job->s_fence->cb)) > > { > >             fence_put(s_job->s_fence->parent); > >             s_job->s_fence->parent > > = > > NULL; > >             atomic_dec(&sched->hw_rq_count); > >         } > >     } > >     spin_unlock(&sched->job_list_lock); > > } > > see that without sched_job->s_fence, you cannot remove > the callback from its hw fence, > > any idea?? > > BR Monk > > > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ amd-gfx mailing list amd-gfx at lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx