Re: [PATCH] drm/sched: drm_sched_job_cleanup(): correct false doc

Philipp Stanner <phasta@xxxxxxxxxxx> · Tue, 04 Mar 2025 08:27:42 +0100

On Mon, 2025-03-03 at 17:13 +0000, Tvrtko Ursulin wrote:
> 
> On 03/03/2025 14:32, Philipp Stanner wrote:
> > drm_sched_job_cleanup()'s documentation claims that calling
> > drm_sched_job_arm() is a "point of no return", implying that
> > afterwards
> > a job cannot be cancelled anymore.
> > 
> > This is not correct, as proven by the function's code itself, which
> > takes a previous call to drm_sched_job_arm() into account. In
> > truth, the
> > decisive factors are whether fences have been shared (e.g., with
> > other
> > processes) and if the job has been submitted to an entity already.
> > 
> > Correct the wrong docstring.
> > 
> > Signed-off-by: Philipp Stanner <phasta@xxxxxxxxxx>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++----
> >   1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index c634993f1346..db9e08e6e455 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1013,11 +1013,13 @@
> > EXPORT_SYMBOL(drm_sched_job_has_dependency);
> >    * Cleans up the resources allocated with drm_sched_job_init().
> >    *
> >    * Drivers should call this from their error unwind code if @job
> > is aborted
> > - * before drm_sched_job_arm() is called.
> > + * before it was submitted to an entity with
> > drm_sched_entity_push_job().
> >    *
> > - * After that point of no return @job is committed to be executed
> > by the
> > - * scheduler, and this function should be called from the
> > - * &drm_sched_backend_ops.free_job callback.
> > + * Since calling drm_sched_job_arm() causes the job's fences to be
> > initialized,
> > + * it is up to the driver to ensure that fences that were exposed
> > to external
> > + * parties get signaled. drm_sched_job_cleanup() does not ensure
> > this.
> > + *
> > + * This function must also be called in &struct
> > drm_sched_backend_ops.free_job
> >    */
> >   void drm_sched_job_cleanup(struct drm_sched_job *job)
> >   {
> 
> I also do not see anything so special in the flows which would
> disallow 
> "aborting" jobs after arming. And it definitely can happen in the 
> practice. It was probably just unclear kerneldoc.

I thought about why they wrote that for a few months and the only
reason I can imagine was that they were worried about the initialized
fences being shared.

> 
> What I would suggest to add to the patch is to change the comment in 
> drm_sched_job_cleanup() implementation:
> 
> This one:
> 
> 		/* aborted job before committing to run it */
> 
> Probably just change to:
> 
> 		/* aborted before arming */

Yes, good catch. Will do.

P.

> 
> With that you can add my r-b.
> 
> Regards,
> 
> Tvrtko
>