Re: [PATCH 1/4] drm/scheduler: Add drm_sched_cancel_all_jobs helper

Philipp Stanner <phasta@xxxxxxxxxxx> · Thu, 06 Feb 2025 15:00:16 +0100

On Thu, 2025-02-06 at 13:53 +0000, Tvrtko Ursulin wrote:
> 
> On 06/02/2025 13:46, Christian König wrote:
> > Am 06.02.25 um 14:35 schrieb Philipp Stanner:
> > > On Wed, 2025-02-05 at 15:33 +0000, Tvrtko Ursulin wrote:
> > > > The helper copies code from the existing
> > > > amdgpu_job_stop_all_jobs_on_sched
> > > > with the purpose of reducing the amount of driver code which
> > > > directly
> > > > touch scheduler internals.
> > > > 
> > > > If or when amdgpu manages to change the approach for handling
> > > > the
> > > > permanently wedged state this helper can be removed.
> > > Have you checked how many other drivers might need such a helper?
> > > 
> > > I have a bit mixed feelings about this, because, AFAICT, in the
> > > past
> > > helpers have been added for just 1 driver, such as
> > > drm_sched_wqueue_ready(), and then they have stayed for almost a
> > > decade.
> > > 
> > > AFAIU this is just code move, and only really "decouples" amdgpu
> > > in the
> > > sense of having an official scheduler function that does what
> > > amdgpu
> > > used to do.
> > > 
> > > So my tendency here would be to continue "allowing" amdgpu to
> > > touch the
> > > scheduler internals until amdgpu fixes this "permanently wedged
> > > state". And if that's too difficult, couldn't the helper reside
> > > in a
> > > amdgpu/sched_helpers.c or similar?
> > > 
> > > I think that's better than adding 1 helper for just 1 driver and
> > > then
> > > supposedly removing it again in the future.
> > 
> > Yeah, agree to that general approach.
> > 
> > What amdgpu does here is kind of nasty and looks unnecessary, but 
> > changing it means we need time from Hawkings and his people
> > involved on 
> > RAS for amdgpu.
> > 
> > When we move the code to the scheduler we make it official
> > scheduler 
> > interface to others to replicate and that is exactly what we should
> > try 
> > to avoid.
> > 
> > So my suggestion is to add a /* TODO: This is nasty and should be 
> > avoided */ to the amdgpu code instead.
> 
> So I got a no go to export a low level queue pop helper

The spsc_queue helper in patch 3 is totally alright. Patch 3 only
depends on patch 1 in the sense of it adding the new helper to the
cancel_all function of patch 1, or am I missing something obvious?

> , no go to move 
> the whole dodgy code to common (reasonable). Any third way to break
> the 
> status quo? What if I respin with just a change local to amdgpu which
> would, instead of duplicating the to_drm_sched_job macro, duplicate 
> __drm_sched_entity_queue_pop from 3/4 of this series?

I'm willing to take patch 3 if it's independent. That would then mean
that to_drm_sched_job() is only necessary anmyore in amdgpu, wouldn't
it?

That's independent from the cancel_all() function as far as the
scheduler is concerned.

P.

> 
> Regards,
> 
> Tvrtko
> 
> > 
> > Regards,
> > Christian.
> > 
> > > 
> > > P.
> > > 
> > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>
> > > > Cc: Christian König <christian.koenig@xxxxxxx>
> > > > Cc: Danilo Krummrich <dakr@xxxxxxxxxx>
> > > > Cc: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > > Cc: Philipp Stanner <phasta@xxxxxxxxxx>
> > > > ---
> > > >   drivers/gpu/drm/scheduler/sched_main.c | 44
> > > > ++++++++++++++++++++++++++
> > > >   include/drm/gpu_scheduler.h            |  1 +
> > > >   2 files changed, 45 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index a48be16ab84f..0363655db22d 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -703,6 +703,50 @@ void drm_sched_start(struct
> > > > drm_gpu_scheduler
> > > > *sched, int errno)
> > > >   }
> > > >   EXPORT_SYMBOL(drm_sched_start);
> > > > +/**
> > > > + * drm_sched_cancel_all_jobs - Cancel all queued and scheduled
> > > > jobs
> > > > + *
> > > > + * @sched: scheduler instance
> > > > + * @errno: error value to set on signaled fences
> > > > + *
> > > > + * Signal all queued and scheduled jobs and set them to error
> > > > state.
> > > > + *
> > > > + * Scheduler must be stopped before calling this.
> > > > + */
> > > > +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler
> > > > *sched, int
> > > > errno)
> > > > +{
> > > > +    struct drm_sched_entity *entity;
> > > > +    struct drm_sched_fence *s_fence;
> > > > +    struct drm_sched_job *job;
> > > > +    enum drm_sched_priority p;
> > > > +
> > > > +    drm_WARN_ON_ONCE(sched, !sched->pause_submit);
> > > > +
> > > > +    /* Signal all jobs not yet scheduled */
> > > > +    for (p = DRM_SCHED_PRIORITY_KERNEL; p < sched->num_rqs;
> > > > p++)
> > > > {
> > > > +        struct drm_sched_rq *rq = sched->sched_rq[p];
> > > > +
> > > > +        spin_lock(&rq->lock);
> > > > +        list_for_each_entry(entity, &rq->entities, list) {
> > > > +            while ((job =
> > > > to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
> > > > +                s_fence = job->s_fence;
> > > > +                dma_fence_signal(&s_fence-
> > > > > scheduled);
> > > > +                dma_fence_set_error(&s_fence-
> > > > > finished, errno);
> > > > +                dma_fence_signal(&s_fence-
> > > > > finished);
> > > > +            }
> > > > +        }
> > > > +        spin_unlock(&rq->lock);
> > > > +    }
> > > > +
> > > > +    /* Signal all jobs already scheduled to HW */
> > > > +    list_for_each_entry(job, &sched->pending_list, list) {
> > > > +        s_fence = job->s_fence;
> > > > +        dma_fence_set_error(&s_fence->finished, errno);
> > > > +        dma_fence_signal(&s_fence->finished);
> > > > +    }
> > > > +}
> > > > +EXPORT_SYMBOL(drm_sched_cancel_all_jobs);
> > > > +
> > > >   /**
> > > >    * drm_sched_resubmit_jobs - Deprecated, don't use in new
> > > > code!
> > > >    *
> > > > diff --git a/include/drm/gpu_scheduler.h
> > > > b/include/drm/gpu_scheduler.h
> > > > index a0ff08123f07..298513f8c327 100644
> > > > --- a/include/drm/gpu_scheduler.h
> > > > +++ b/include/drm/gpu_scheduler.h
> > > > @@ -579,6 +579,7 @@ void drm_sched_wqueue_stop(struct
> > > > drm_gpu_scheduler *sched);
> > > >   void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
> > > >   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct
> > > > drm_sched_job *bad);
> > > >   void drm_sched_start(struct drm_gpu_scheduler *sched, int
> > > > errno);
> > > > +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler
> > > > *sched, int
> > > > errno);
> > > >   void drm_sched_resubmit_jobs(struct drm_gpu_scheduler
> > > > *sched);
> > > >   void drm_sched_increase_karma(struct drm_sched_job *bad);
> > > >   void drm_sched_reset_karma(struct drm_sched_job *bad);
> >