Hello Luben, On Wed, 21 Jun 2023 09:56:40 -0400 Luben Tuikov <luben.tuikov@xxxxxxx> wrote: > On 2023-06-19 03:19, Boris Brezillon wrote: > > drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped > > from the dependency array that was waited upon before > > drm_sched_entity_kill() was called (drm_sched_entity::dependency field), > > so we're basically waiting for all dependencies except one. > > > > In theory, this wait shouldn't be needed because resources should have > > their users registered to the dma_resv object, thus guaranteeing that > > future jobs wanting to access these resources wait on all the previous > > users (depending on the access type, of course). But we want to keep > > these explicit waits in the kill entity path just in case. > > > > Let's make sure we keep all dependencies in the array in > > drm_sched_job_dependency(), so we can iterate over the array and wait > > in drm_sched_entity_kill_jobs_cb(). > > > > We also make sure we wait on drm_sched_fence::finished if we were > > originally asked to wait on drm_sched_fence::scheduled. In that case, > > we assume the intent was to delegate the wait to the firmware/GPU or > > rely on the pipelining done at the entity/scheduler level, but when > > killing jobs, we really want to wait for completion not just scheduling. > > > > v6: > > - Back to v4 implementation > > - Add Christian's R-b > > > > v5: > > - Flag deps on which we should only wait for the scheduled event > > at insertion time > > > > v4: > > - Fix commit message > > - Fix a use-after-free bug > > > > v3: > > - Always wait for drm_sched_fence::finished fences in > > drm_sched_entity_kill_jobs_cb() when we see a sched_fence > > > > v2: > > - Don't evict deps in drm_sched_job_dependency() > > Hmm, why is this in reverse chronological order? > It's very confusing. Dunno, that's how I've always ordered things, and quick look at some dri-devel patches [1][2] makes me think I'm not the only one to start from the latest submission. [1]https://lkml.org/lkml/2023/6/19/941 [2]https://lore.kernel.org/dri-devel/cover.1686729444.git.Sandor.yu@xxxxxxx/T/#t > > > > > Signed-off-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> > > Suggested-by: "Christian König" <christian.koenig@xxxxxxx> > > Reviewed-by: "Christian König" <christian.koenig@xxxxxxx> > > These three lines would usually come after the CCs. Again, I think I've always inserted those tags before the Cc, but I can re-order things if you prefer. Let me know if you want me to send a v7 addressing the Cc+changelog ordering. Regards, Boris > > Regards, > Luben > > > Cc: Frank Binns <frank.binns@xxxxxxxxxx> > > Cc: Sarah Walker <sarah.walker@xxxxxxxxxx> > > Cc: Donald Robson <donald.robson@xxxxxxxxxx> > > Cc: Luben Tuikov <luben.tuikov@xxxxxxx> > > Cc: David Airlie <airlied@xxxxxxxxx> > > Cc: Daniel Vetter <daniel@xxxxxxxx> > > Cc: Sumit Semwal <sumit.semwal@xxxxxxxxxx> > > Cc: "Christian König" <christian.koenig@xxxxxxx> > > --- > > drivers/gpu/drm/scheduler/sched_entity.c | 41 +++++++++++++++++++----- > > 1 file changed, 33 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > > index 68e807ae136a..ec41d82d0141 100644 > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > @@ -176,16 +176,32 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, > > { > > struct drm_sched_job *job = container_of(cb, struct drm_sched_job, > > finish_cb); > > - int r; > > + unsigned long index; > > > > dma_fence_put(f); > > > > /* Wait for all dependencies to avoid data corruptions */ > > - while (!xa_empty(&job->dependencies)) { > > - f = xa_erase(&job->dependencies, job->last_dependency++); > > - r = dma_fence_add_callback(f, &job->finish_cb, > > - drm_sched_entity_kill_jobs_cb); > > - if (!r) > > + xa_for_each(&job->dependencies, index, f) { > > + struct drm_sched_fence *s_fence = to_drm_sched_fence(f); > > + > > + if (s_fence && f == &s_fence->scheduled) { > > + /* The dependencies array had a reference on the scheduled > > + * fence, and the finished fence refcount might have > > + * dropped to zero. Use dma_fence_get_rcu() so we get > > + * a NULL fence in that case. > > + */ > > + f = dma_fence_get_rcu(&s_fence->finished); > > + > > + /* Now that we have a reference on the finished fence, > > + * we can release the reference the dependencies array > > + * had on the scheduled fence. > > + */ > > + dma_fence_put(&s_fence->scheduled); > > + } > > + > > + xa_erase(&job->dependencies, index); > > + if (f && !dma_fence_add_callback(f, &job->finish_cb, > > + drm_sched_entity_kill_jobs_cb)) > > return; > > > > dma_fence_put(f); > > @@ -415,8 +431,17 @@ static struct dma_fence * > > drm_sched_job_dependency(struct drm_sched_job *job, > > struct drm_sched_entity *entity) > > { > > - if (!xa_empty(&job->dependencies)) > > - return xa_erase(&job->dependencies, job->last_dependency++); > > + struct dma_fence *f; > > + > > + /* We keep the fence around, so we can iterate over all dependencies > > + * in drm_sched_entity_kill_jobs_cb() to ensure all deps are signaled > > + * before killing the job. > > + */ > > + f = xa_load(&job->dependencies, job->last_dependency); > > + if (f) { > > + job->last_dependency++; > > + return dma_fence_get(f); > > + } > > > > if (job->sched->ops->prepare_job) > > return job->sched->ops->prepare_job(job, entity); >