Re: [PATCH 1/4] drm/scheduler: Add drm_sched_cancel_all_jobs helper

Philipp Stanner <phasta@xxxxxxxxxxx> · Thu, 06 Feb 2025 14:53:46 +0100

On Thu, 2025-02-06 at 14:46 +0100, Christian König wrote:
> Am 06.02.25 um 14:35 schrieb Philipp Stanner:
> > On Wed, 2025-02-05 at 15:33 +0000, Tvrtko Ursulin wrote:
> > > The helper copies code from the existing
> > > amdgpu_job_stop_all_jobs_on_sched
> > > with the purpose of reducing the amount of driver code which
> > > directly
> > > touch scheduler internals.
> > > 
> > > If or when amdgpu manages to change the approach for handling the
> > > permanently wedged state this helper can be removed.
> > Have you checked how many other drivers might need such a helper?
> > 
> > I have a bit mixed feelings about this, because, AFAICT, in the
> > past
> > helpers have been added for just 1 driver, such as
> > drm_sched_wqueue_ready(), and then they have stayed for almost a
> > decade.
> > 
> > AFAIU this is just code move, and only really "decouples" amdgpu in
> > the
> > sense of having an official scheduler function that does what
> > amdgpu
> > used to do.
> > 
> > So my tendency here would be to continue "allowing" amdgpu to touch
> > the
> > scheduler internals until amdgpu fixes this "permanently wedged
> > state". And if that's too difficult, couldn't the helper reside in
> > a
> > amdgpu/sched_helpers.c or similar?
> > 
> > I think that's better than adding 1 helper for just 1 driver and
> > then
> > supposedly removing it again in the future.
> 
> Yeah, agree to that general approach.
> 
> What amdgpu does here is kind of nasty and looks unnecessary, but 
> changing it means we need time from Hawkings and his people involved
> on 
> RAS for amdgpu.
> 
> When we move the code to the scheduler we make it official scheduler 
> interface to others to replicate and that is exactly what we should
> try 
> to avoid.

Yes, I think if we all agree that the scheduler must only contain
infrastructure useful for >= 2 DRM drivers' job queueing related tasks
without any hacks for driver internal issues, that would be a great
thing.

P.

> 
> So my suggestion is to add a /* TODO: This is nasty and should be 
> avoided */ to the amdgpu code instead.
> 
> Regards,
> Christian.
> 
> > 
> > P.
> > 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>
> > > Cc: Christian König <christian.koenig@xxxxxxx>
> > > Cc: Danilo Krummrich <dakr@xxxxxxxxxx>
> > > Cc: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > Cc: Philipp Stanner <phasta@xxxxxxxxxx>
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_main.c | 44
> > > ++++++++++++++++++++++++++
> > >   include/drm/gpu_scheduler.h            |  1 +
> > >   2 files changed, 45 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index a48be16ab84f..0363655db22d 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -703,6 +703,50 @@ void drm_sched_start(struct
> > > drm_gpu_scheduler
> > > *sched, int errno)
> > >   }
> > >   EXPORT_SYMBOL(drm_sched_start);
> > >   
> > > +/**
> > > + * drm_sched_cancel_all_jobs - Cancel all queued and scheduled
> > > jobs
> > > + *
> > > + * @sched: scheduler instance
> > > + * @errno: error value to set on signaled fences
> > > + *
> > > + * Signal all queued and scheduled jobs and set them to error
> > > state.
> > > + *
> > > + * Scheduler must be stopped before calling this.
> > > + */
> > > +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched,
> > > int
> > > errno)
> > > +{
> > > +	struct drm_sched_entity *entity;
> > > +	struct drm_sched_fence *s_fence;
> > > +	struct drm_sched_job *job;
> > > +	enum drm_sched_priority p;
> > > +
> > > +	drm_WARN_ON_ONCE(sched, !sched->pause_submit);
> > > +
> > > +	/* Signal all jobs not yet scheduled */
> > > +	for (p = DRM_SCHED_PRIORITY_KERNEL; p < sched->num_rqs;
> > > p++)
> > > {
> > > +		struct drm_sched_rq *rq = sched->sched_rq[p];
> > > +
> > > +		spin_lock(&rq->lock);
> > > +		list_for_each_entry(entity, &rq->entities, list)
> > > {
> > > +			while ((job =
> > > to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
> > > +				s_fence = job->s_fence;
> > > +				dma_fence_signal(&s_fence-
> > > > scheduled);
> > > +				dma_fence_set_error(&s_fence-
> > > > finished, errno);
> > > +				dma_fence_signal(&s_fence-
> > > > finished);
> > > +			}
> > > +		}
> > > +		spin_unlock(&rq->lock);
> > > +	}
> > > +
> > > +	/* Signal all jobs already scheduled to HW */
> > > +	list_for_each_entry(job, &sched->pending_list, list) {
> > > +		s_fence = job->s_fence;
> > > +		dma_fence_set_error(&s_fence->finished, errno);
> > > +		dma_fence_signal(&s_fence->finished);
> > > +	}
> > > +}
> > > +EXPORT_SYMBOL(drm_sched_cancel_all_jobs);
> > > +
> > >   /**
> > >    * drm_sched_resubmit_jobs - Deprecated, don't use in new code!
> > >    *
> > > diff --git a/include/drm/gpu_scheduler.h
> > > b/include/drm/gpu_scheduler.h
> > > index a0ff08123f07..298513f8c327 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -579,6 +579,7 @@ void drm_sched_wqueue_stop(struct
> > > drm_gpu_scheduler *sched);
> > >   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
> > >   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
> > > drm_sched_job *bad);
> > >   void drm_sched_start(struct drm_gpu_scheduler *sched, int
> > > errno);
> > > +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched,
> > > int
> > > errno);
> > >   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> > >   void drm_sched_increase_karma(struct drm_sched_job *bad);
> > >   void drm_sched_reset_karma(struct drm_sched_job *bad);
>