On Thu, 2025-02-06 at 14:46 +0100, Christian König wrote: > Am 06.02.25 um 14:35 schrieb Philipp Stanner: > > On Wed, 2025-02-05 at 15:33 +0000, Tvrtko Ursulin wrote: > > > The helper copies code from the existing > > > amdgpu_job_stop_all_jobs_on_sched > > > with the purpose of reducing the amount of driver code which > > > directly > > > touch scheduler internals. > > > > > > If or when amdgpu manages to change the approach for handling the > > > permanently wedged state this helper can be removed. > > Have you checked how many other drivers might need such a helper? > > > > I have a bit mixed feelings about this, because, AFAICT, in the > > past > > helpers have been added for just 1 driver, such as > > drm_sched_wqueue_ready(), and then they have stayed for almost a > > decade. > > > > AFAIU this is just code move, and only really "decouples" amdgpu in > > the > > sense of having an official scheduler function that does what > > amdgpu > > used to do. > > > > So my tendency here would be to continue "allowing" amdgpu to touch > > the > > scheduler internals until amdgpu fixes this "permanently wedged > > state". And if that's too difficult, couldn't the helper reside in > > a > > amdgpu/sched_helpers.c or similar? > > > > I think that's better than adding 1 helper for just 1 driver and > > then > > supposedly removing it again in the future. > > Yeah, agree to that general approach. > > What amdgpu does here is kind of nasty and looks unnecessary, but > changing it means we need time from Hawkings and his people involved > on > RAS for amdgpu. > > When we move the code to the scheduler we make it official scheduler > interface to others to replicate and that is exactly what we should > try > to avoid. Yes, I think if we all agree that the scheduler must only contain infrastructure useful for >= 2 DRM drivers' job queueing related tasks without any hacks for driver internal issues, that would be a great thing. P. > > So my suggestion is to add a /* TODO: This is nasty and should be > avoided */ to the amdgpu code instead. > > Regards, > Christian. > > > > > P. > > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> > > > Cc: Christian König <christian.koenig@xxxxxxx> > > > Cc: Danilo Krummrich <dakr@xxxxxxxxxx> > > > Cc: Matthew Brost <matthew.brost@xxxxxxxxx> > > > Cc: Philipp Stanner <phasta@xxxxxxxxxx> > > > --- > > > drivers/gpu/drm/scheduler/sched_main.c | 44 > > > ++++++++++++++++++++++++++ > > > include/drm/gpu_scheduler.h | 1 + > > > 2 files changed, 45 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > index a48be16ab84f..0363655db22d 100644 > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > @@ -703,6 +703,50 @@ void drm_sched_start(struct > > > drm_gpu_scheduler > > > *sched, int errno) > > > } > > > EXPORT_SYMBOL(drm_sched_start); > > > > > > +/** > > > + * drm_sched_cancel_all_jobs - Cancel all queued and scheduled > > > jobs > > > + * > > > + * @sched: scheduler instance > > > + * @errno: error value to set on signaled fences > > > + * > > > + * Signal all queued and scheduled jobs and set them to error > > > state. > > > + * > > > + * Scheduler must be stopped before calling this. > > > + */ > > > +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, > > > int > > > errno) > > > +{ > > > + struct drm_sched_entity *entity; > > > + struct drm_sched_fence *s_fence; > > > + struct drm_sched_job *job; > > > + enum drm_sched_priority p; > > > + > > > + drm_WARN_ON_ONCE(sched, !sched->pause_submit); > > > + > > > + /* Signal all jobs not yet scheduled */ > > > + for (p = DRM_SCHED_PRIORITY_KERNEL; p < sched->num_rqs; > > > p++) > > > { > > > + struct drm_sched_rq *rq = sched->sched_rq[p]; > > > + > > > + spin_lock(&rq->lock); > > > + list_for_each_entry(entity, &rq->entities, list) > > > { > > > + while ((job = > > > to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) { > > > + s_fence = job->s_fence; > > > + dma_fence_signal(&s_fence- > > > > scheduled); > > > + dma_fence_set_error(&s_fence- > > > > finished, errno); > > > + dma_fence_signal(&s_fence- > > > > finished); > > > + } > > > + } > > > + spin_unlock(&rq->lock); > > > + } > > > + > > > + /* Signal all jobs already scheduled to HW */ > > > + list_for_each_entry(job, &sched->pending_list, list) { > > > + s_fence = job->s_fence; > > > + dma_fence_set_error(&s_fence->finished, errno); > > > + dma_fence_signal(&s_fence->finished); > > > + } > > > +} > > > +EXPORT_SYMBOL(drm_sched_cancel_all_jobs); > > > + > > > /** > > > * drm_sched_resubmit_jobs - Deprecated, don't use in new code! > > > * > > > diff --git a/include/drm/gpu_scheduler.h > > > b/include/drm/gpu_scheduler.h > > > index a0ff08123f07..298513f8c327 100644 > > > --- a/include/drm/gpu_scheduler.h > > > +++ b/include/drm/gpu_scheduler.h > > > @@ -579,6 +579,7 @@ void drm_sched_wqueue_stop(struct > > > drm_gpu_scheduler *sched); > > > void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched); > > > void drm_sched_stop(struct drm_gpu_scheduler *sched, struct > > > drm_sched_job *bad); > > > void drm_sched_start(struct drm_gpu_scheduler *sched, int > > > errno); > > > +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, > > > int > > > errno); > > > void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched); > > > void drm_sched_increase_karma(struct drm_sched_job *bad); > > > void drm_sched_reset_karma(struct drm_sched_job *bad); >