Hello Christian, On Fri, 9 Jun 2023 13:53:59 +0200 Christian König <christian.koenig@xxxxxxx> wrote: > Am 08.06.23 um 08:55 schrieb Boris Brezillon: > > If I understand correctly, drm_sched_entity_kill_jobs_cb() is supposed > > to wait on all the external dependencies (those added to > > drm_sched_job::dependencies) before signaling the job finished fence. > > This is done this way to prevent jobs depending on these canceled jobs > > from considering the resources they want to access as ready, when > > they're actually still used by other components, thus leading to > > potential data corruptions. > > > > The problem is, the kill_jobs logic is omitting the last fence popped > > from the dependencies array that was waited upon before > > drm_sched_entity_kill() was called (drm_sched_entity::dependency field), > > so we're basically waiting for all dependencies except one. > > > > This is an attempt at fixing that, but also an opportunity to make sure > > I understood the drm_sched_entity_kill(), because I'm not 100% sure if > > skipping this currently popped dependency was intentional or not. I can't > > see a good reason why we'd want to skip that one, but I might be missing > > something. > > > > Signed-off-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> > > Cc: Frank Binns <frank.binns@xxxxxxxxxx> > > Cc: Sarah Walker <sarah.walker@xxxxxxxxxx> > > Cc: Donald Robson <donald.robson@xxxxxxxxxx> > > Cc: Luben Tuikov <luben.tuikov@xxxxxxx> > > Cc: David Airlie <airlied@xxxxxxxxx> > > Cc: Daniel Vetter <daniel@xxxxxxxx> > > Cc: Sumit Semwal <sumit.semwal@xxxxxxxxxx> > > Cc: "Christian König" <christian.koenig@xxxxxxx> > > --- > > Stumbled across this issue while working on native dependency support > > with Donald (which will be posted soon). Flagged as RFC, because I'm > > not sure this is legit, and also not sure we want to fix it this way. > > I tried re-using drm_sched_entity::dependency, but it's a bit of a mess > > because of the asynchronousity of the wait, and the fact we use > > drm_sched_entity::dependency to know if we have a clear_dep() > > callback registered, so we can easily reset it without removing the > > callback. > > Well yes, that's a known problem. But this is really not the right > approach to fixing this. > > Trying to wait for all the dependencies before killing jobs was added > because of the way we kept track of dma_fences in dma_resv objects. > Basically adding exclusive fences removed all other fences leading to a > bit fragile memory management. Okay. > > This handling was removed by now and so the workaround for waiting for > dependencies is not really necessary any more, but I think it is still > better to do so. > > The right approach of getting this waiting for dependencies completely > straight is also not to touch entity->dependency in any way, but to stop > removing them from the XA in drm_sched_job_dependency(). Otherwise you > don't catch the pipeline optimized ones either. Do you want me to post a v2 doing that, or should I forget about it? If we decide to keep things like that, it might be good to at least add a comment explaining why we don't care. Regards, Boris