Re: [PATCH] drm/sched: revert "drm_sched_job_cleanup(): correct false doc"

Philipp Stanner <phasta@xxxxxxxxxxx> · Mon, 10 Mar 2025 09:41:36 +0100

On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote:
> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.
> 
> Sorry for the delayed response, I only stumbled over this now while
> going
> over old mails and then re-thinking my reviewed by for this change.
> 
> The function drm_sched_job_arm() is indeed the point of no return.

So would you say that the comment in the function's body,
"drm_sched_job_arm() has been called" actually means / should mean "job
had been submitted with drm_sched_entity_push_job()"?

> The
> background is that it is nearly impossible for the driver to
> correctly
> retract the fence and signal it in the order enforced by the
> dma_fence
> framework.
> 
> The code in drm_sched_job_cleanup() is for the purpose to cleanup
> after
> the job was armed through drm_sched_job_arm() *and* processed by the
> scheduler.
> 
> The correct approach for error handling in this situation is to set
> the
> error on the fences and then push to the entity anyway.

You expect the driver to set an error on scheduled and finished fence?
I think this would be very likely to explode. AFAICS the scheduler code
has no awareness for anyone else having touched those fences.

If at all, if this is really a problem, we should tell the driver that
it must enforce that there will be no failure between
drm_sched_job_arm() and drm_sched_job_entity_push_job(). That's how
Nouveau does it.

Let's set our understanding straight before reverting. What
drm_sched_job_arm() does is:

   1. Pick scheduler, rq and priority for the job
   2. Atomically increment the job_id_count and assign to job
   3. Call drm_sched_fence_init() and therefore:
   4. Set the fence's scheduler
   5. Set the fences seq nr atomically
   6. Initialize scheduled and finished fence

What of the above precisely is the problem?

You say the driver cannot "correctly retract the fence and signal it in
the order enforced by the dma_fence framework". You mean that the
finished_fences have to be signalled in an order only the scheduler
knows? I assume you're referring to set dependencies?

P.

> We can certainly
> improve the documentation, but removing the warning is clearly not a
> good
> idea.
> 
> Signed-off-by: Christian König <christian.koenig@xxxxxxx>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 53e6aec37b46..4d4219fbe49d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>   * Cleans up the resources allocated with drm_sched_job_init().
>   *
>   * Drivers should call this from their error unwind code if @job is
> aborted
> - * before it was submitted to an entity with
> drm_sched_entity_push_job().
> + * before drm_sched_job_arm() is called.
>   *
> - * Since calling drm_sched_job_arm() causes the job's fences to be
> initialized,
> - * it is up to the driver to ensure that fences that were exposed to
> external
> - * parties get signaled. drm_sched_job_cleanup() does not ensure
> this.
> - *
> - * This function must also be called in &struct
> drm_sched_backend_ops.free_job
> + * After that point of no return @job is committed to be executed by
> the
> + * scheduler, and this function should be called from the
> + * &drm_sched_backend_ops.free_job callback.
>   */
>  void drm_sched_job_cleanup(struct drm_sched_job *job)
>  {
> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job
> *job)
>   /* drm_sched_job_arm() has been called */
>   dma_fence_put(&job->s_fence->finished);
>   } else {
> - /* aborted job before arming */
> + /* aborted job before committing to run it */
>   drm_sched_fence_free(job->s_fence);
>   }
>