On Tue, 13 Jun 2023 11:28:45 +0200 Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped > from the dependency array that was waited upon before > drm_sched_entity_kill() was called (drm_sched_entity::dependency field), > so we're basically waiting for all dependencies except one. > > In theory, this wait shouldn't be needed because resources should have > their users registered to the dma_resv object, thus guaranteeing that > future jobs wanting to access these resources wait on all the previous > users (depending on the access type, of course). But we want to keep > these explicit waits in the kill entity path just in case. > > Let's make sure we keep all dependencies in the array in > drm_sched_job_dependency(), so we can iterate over the array and wait > in drm_sched_entity_kill_jobs_cb(). > > We also make sure we wait on drm_sched_fence::finished if we were asked > to wait on drm_sched_fence::scheduled, but the intent was probably to > delegate the wait to the GPU, but we want resources to be completely > idle when killing jobs. > > v3: > - Always wait for drm_sched_fence::finished fences in > drm_sched_entity_kill_jobs_cb() when we see a sched_fence > > v2: > - Don't evict deps in drm_sched_job_dependency() > > Signed-off-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> > Suggested-by: "Christian König" <christian.koenig@xxxxxxx> > Cc: Frank Binns <frank.binns@xxxxxxxxxx> > Cc: Sarah Walker <sarah.walker@xxxxxxxxxx> > Cc: Donald Robson <donald.robson@xxxxxxxxxx> > Cc: Luben Tuikov <luben.tuikov@xxxxxxx> > Cc: David Airlie <airlied@xxxxxxxxx> > Cc: Daniel Vetter <daniel@xxxxxxxx> > Cc: Sumit Semwal <sumit.semwal@xxxxxxxxxx> > Cc: "Christian König" <christian.koenig@xxxxxxx> > --- > drivers/gpu/drm/scheduler/sched_entity.c | 28 ++++++++++++++++++++---- > 1 file changed, 24 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > index 68e807ae136a..bc1bc3d47f7f 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -176,13 +176,24 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, > { > struct drm_sched_job *job = container_of(cb, struct drm_sched_job, > finish_cb); > + unsigned long index; > int r; > > dma_fence_put(f); > > /* Wait for all dependencies to avoid data corruptions */ > - while (!xa_empty(&job->dependencies)) { > - f = xa_erase(&job->dependencies, job->last_dependency++); > + xa_for_each(&job->dependencies, index, f) { > + struct drm_sched_fence *s_fence = to_drm_sched_fence(f); > + > + /* Make sure we wait for the finished fence here, so we can > + * guarantee that any job we depend on that is still accessing > + * resources is done before we signal this job finished fence > + * and unblock further accesses on those resources. > + */ > + if (s_fence && f == &s_fence->scheduled) > + f = &s_fence->finished; Oops. There's a use-after-free+leak bug here: when switching from scheduled to finished fence, we need to grab a ref on the finished fence and release the one we had on the scheduled fence. Sending a v4 to fix that. > + > + xa_erase(&job->dependencies, index); > r = dma_fence_add_callback(f, &job->finish_cb, > drm_sched_entity_kill_jobs_cb); > if (!r) > @@ -415,8 +426,17 @@ static struct dma_fence * > drm_sched_job_dependency(struct drm_sched_job *job, > struct drm_sched_entity *entity) > { > - if (!xa_empty(&job->dependencies)) > - return xa_erase(&job->dependencies, job->last_dependency++); > + struct dma_fence *f; > + > + /* We keep the fence around, so we can iterate over all dependencies > + * in drm_sched_entity_kill_jobs_cb() to ensure all deps are signaled > + * before killing the job. > + */ > + f = xa_load(&job->dependencies, job->last_dependency); > + if (f) { > + job->last_dependency++; > + return dma_fence_get(f); > + } > > if (job->sched->ops->prepare_job) > return job->sched->ops->prepare_job(job, entity);