On 2023-06-19 03:19, Boris Brezillon wrote: > drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped > from the dependency array that was waited upon before > drm_sched_entity_kill() was called (drm_sched_entity::dependency field), > so we're basically waiting for all dependencies except one. > > In theory, this wait shouldn't be needed because resources should have > their users registered to the dma_resv object, thus guaranteeing that > future jobs wanting to access these resources wait on all the previous > users (depending on the access type, of course). But we want to keep > these explicit waits in the kill entity path just in case. > > Let's make sure we keep all dependencies in the array in > drm_sched_job_dependency(), so we can iterate over the array and wait > in drm_sched_entity_kill_jobs_cb(). > > We also make sure we wait on drm_sched_fence::finished if we were > originally asked to wait on drm_sched_fence::scheduled. In that case, > we assume the intent was to delegate the wait to the firmware/GPU or > rely on the pipelining done at the entity/scheduler level, but when > killing jobs, we really want to wait for completion not just scheduling. > > v6: > - Back to v4 implementation > - Add Christian's R-b > > v5: > - Flag deps on which we should only wait for the scheduled event > at insertion time > > v4: > - Fix commit message > - Fix a use-after-free bug > > v3: > - Always wait for drm_sched_fence::finished fences in > drm_sched_entity_kill_jobs_cb() when we see a sched_fence > > v2: > - Don't evict deps in drm_sched_job_dependency() Hmm, why is this in reverse chronological order? It's very confusing. > > Signed-off-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> > Suggested-by: "Christian König" <christian.koenig@xxxxxxx> > Reviewed-by: "Christian König" <christian.koenig@xxxxxxx> These three lines would usually come after the CCs. Regards, Luben > Cc: Frank Binns <frank.binns@xxxxxxxxxx> > Cc: Sarah Walker <sarah.walker@xxxxxxxxxx> > Cc: Donald Robson <donald.robson@xxxxxxxxxx> > Cc: Luben Tuikov <luben.tuikov@xxxxxxx> > Cc: David Airlie <airlied@xxxxxxxxx> > Cc: Daniel Vetter <daniel@xxxxxxxx> > Cc: Sumit Semwal <sumit.semwal@xxxxxxxxxx> > Cc: "Christian König" <christian.koenig@xxxxxxx> > --- > drivers/gpu/drm/scheduler/sched_entity.c | 41 +++++++++++++++++++----- > 1 file changed, 33 insertions(+), 8 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > index 68e807ae136a..ec41d82d0141 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -176,16 +176,32 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, > { > struct drm_sched_job *job = container_of(cb, struct drm_sched_job, > finish_cb); > - int r; > + unsigned long index; > > dma_fence_put(f); > > /* Wait for all dependencies to avoid data corruptions */ > - while (!xa_empty(&job->dependencies)) { > - f = xa_erase(&job->dependencies, job->last_dependency++); > - r = dma_fence_add_callback(f, &job->finish_cb, > - drm_sched_entity_kill_jobs_cb); > - if (!r) > + xa_for_each(&job->dependencies, index, f) { > + struct drm_sched_fence *s_fence = to_drm_sched_fence(f); > + > + if (s_fence && f == &s_fence->scheduled) { > + /* The dependencies array had a reference on the scheduled > + * fence, and the finished fence refcount might have > + * dropped to zero. Use dma_fence_get_rcu() so we get > + * a NULL fence in that case. > + */ > + f = dma_fence_get_rcu(&s_fence->finished); > + > + /* Now that we have a reference on the finished fence, > + * we can release the reference the dependencies array > + * had on the scheduled fence. > + */ > + dma_fence_put(&s_fence->scheduled); > + } > + > + xa_erase(&job->dependencies, index); > + if (f && !dma_fence_add_callback(f, &job->finish_cb, > + drm_sched_entity_kill_jobs_cb)) > return; > > dma_fence_put(f); > @@ -415,8 +431,17 @@ static struct dma_fence * > drm_sched_job_dependency(struct drm_sched_job *job, > struct drm_sched_entity *entity) > { > - if (!xa_empty(&job->dependencies)) > - return xa_erase(&job->dependencies, job->last_dependency++); > + struct dma_fence *f; > + > + /* We keep the fence around, so we can iterate over all dependencies > + * in drm_sched_entity_kill_jobs_cb() to ensure all deps are signaled > + * before killing the job. > + */ > + f = xa_load(&job->dependencies, job->last_dependency); > + if (f) { > + job->last_dependency++; > + return dma_fence_get(f); > + } > > if (job->sched->ops->prepare_job) > return job->sched->ops->prepare_job(job, entity);