On Thu, 8 Jun 2023 08:55:51 +0200 Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > If I understand correctly, drm_sched_entity_kill_jobs_cb() is supposed > to wait on all the external dependencies (those added to > drm_sched_job::dependencies) before signaling the job finished fence. > This is done this way to prevent jobs depending on these canceled jobs > from considering the resources they want to access as ready, when > they're actually still used by other components, thus leading to > potential data corruptions. > > The problem is, the kill_jobs logic is omitting the last fence popped > from the dependencies array that was waited upon before > drm_sched_entity_kill() was called (drm_sched_entity::dependency field), > so we're basically waiting for all dependencies except one. > > This is an attempt at fixing that, but also an opportunity to make sure > I understood the drm_sched_entity_kill(), because I'm not 100% sure if ^ the drm_sched_entity_kill() logic correctly, ... > skipping this currently popped dependency was intentional or not. I can't > see a good reason why we'd want to skip that one, but I might be missing > something. > > Signed-off-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> > Cc: Frank Binns <frank.binns@xxxxxxxxxx> > Cc: Sarah Walker <sarah.walker@xxxxxxxxxx> > Cc: Donald Robson <donald.robson@xxxxxxxxxx> > Cc: Luben Tuikov <luben.tuikov@xxxxxxx> > Cc: David Airlie <airlied@xxxxxxxxx> > Cc: Daniel Vetter <daniel@xxxxxxxx> > Cc: Sumit Semwal <sumit.semwal@xxxxxxxxxx> > Cc: "Christian König" <christian.koenig@xxxxxxx> > --- > Stumbled across this issue while working on native dependency support > with Donald (which will be posted soon). Flagged as RFC, because I'm > not sure this is legit, and also not sure we want to fix it this way. > I tried re-using drm_sched_entity::dependency, but it's a bit of a mess > because of the asynchronousity of the wait, and the fact we use > drm_sched_entity::dependency to know if we have a clear_dep() > callback registered, so we can easily reset it without removing the ^ we can't ... > callback.