On Fri, 2024-12-20 at 14:25 +0100, Christian König wrote: > Am 20.12.24 um 14:18 schrieb Danilo Krummrich: > > On Fri, Dec 20, 2024 at 01:53:34PM +0100, Christian König wrote: > > > Am 20.12.24 um 13:45 schrieb Philipp Stanner: > > > > From: Philipp Stanner <pstanner@xxxxxxxxxx> > > > > > > > > drm_sched_backend_ops.run_job() returns a dma_fence for the > > > > scheduler. > > > > That fence is signalled by the driver once the hardware > > > > completed the > > > > associated job. The scheduler does not increment the reference > > > > count on > > > > that fence, but implicitly expects to inherit this fence from > > > > run_job(). > > > > > > > > This is relatively subtle and prone to misunderstandings. > > > > > > > > This implies that, to keep a reference for itself, a driver > > > > needs to > > > > call dma_fence_get() in addition to dma_fence_init() in that > > > > callback. > > > > > > > > It's further complicated by the fact that the scheduler even > > > > decrements > > > > the refcount in drm_sched_run_job_work() since it created a new > > > > reference in drm_sched_fence_scheduled(). It does, however, > > > > still use > > > > its pointer to the fence after calling dma_fence_put() - which > > > > is safe > > > > because of the aforementioned new reference, but actually still > > > > violates > > > > the refcounting rules. > > > > > > > > Improve the explanatory comment for that decrement. > > > > > > > > Move the call to dma_fence_put() to the position behind the > > > > last usage > > > > of the fence. > > > > > > > > Document the necessity to increment the reference count in > > > > drm_sched_backend_ops.run_job(). > > > > > > > > Cc: Christian König <christian.koenig@xxxxxxx> > > > > Cc: Tvrtko Ursulin <tursulin@xxxxxxxxxxx> > > > > Cc: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > > > > Signed-off-by: Philipp Stanner <pstanner@xxxxxxxxxx> > > > > --- > > > > drivers/gpu/drm/scheduler/sched_main.c | 10 +++++++--- > > > > include/drm/gpu_scheduler.h | 20 > > > > ++++++++++++++++---- > > > > 2 files changed, 23 insertions(+), 7 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > index 7ce25281c74c..d6f8df39d848 100644 > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > + * > > > > + * @sched_job: the job to run > > > > + * > > > > + * Returns: dma_fence the driver must signal once the > > > > hardware has > > > > + * completed the job ("hardware fence"). > > > > + * > > > > + * Note that the scheduler expects to 'inherit' its > > > > own reference to > > > > + * this fence from the callback. It does not invoke an > > > > extra > > > > + * dma_fence_get() on it. Consequently, this callback > > > > must return a > > > > + * fence whose refcount is at least 2: One for the > > > > scheduler's > > > > + * reference returned here, another one for the > > > > reference kept by the > > > > + * driver. > > > Well the driver actually doesn't need any extra reference. The > > > scheduler > > > just needs to guarantee that this reference isn't dropped before > > > it is > > > signaled. > > I think he means the reference the driver's fence context has to > > have in order > > to signal that thing eventually. > > Yeah, but this is usually a weak reference. IIRC most drivers don't > increment the reference count for the reference they keep to signal a > fence. > > It's expected that the consumers of the dma_fence keep the fence > alive > at least until it is signaled. So are you saying that the driver having an extra reference (without having obtained it with dma_fence_get()) is not an issue because the driver is the one who will signal the fence [and then be done with it]? > That's why we have this nice warning in > dma_fence_release(). > > On the other hand I completely agree it would be more defensive if > drivers increment the reference count for the reference they keep for > signaling. > > So if we want to document that the fence reference count should at > least > be 2 we somehow need to enforce this with a warning for example. We could – but I'm not sure whether it really needs to be "enforced", especially if it were only to be a minor issue, as you seem to hint at above. Document it is the minimum IMO P. > > Regards, > Christian. > > > > > > > > Regards, > > > Christian. > > > > > > > */ > > > > struct dma_fence *(*run_job)(struct drm_sched_job > > > > *sched_job); >