On Thu, 2025-02-06 at 13:53 +0000, Tvrtko Ursulin wrote: > > On 06/02/2025 13:46, Christian König wrote: > > Am 06.02.25 um 14:35 schrieb Philipp Stanner: > > > On Wed, 2025-02-05 at 15:33 +0000, Tvrtko Ursulin wrote: > > > > The helper copies code from the existing > > > > amdgpu_job_stop_all_jobs_on_sched > > > > with the purpose of reducing the amount of driver code which > > > > directly > > > > touch scheduler internals. > > > > > > > > If or when amdgpu manages to change the approach for handling > > > > the > > > > permanently wedged state this helper can be removed. > > > Have you checked how many other drivers might need such a helper? > > > > > > I have a bit mixed feelings about this, because, AFAICT, in the > > > past > > > helpers have been added for just 1 driver, such as > > > drm_sched_wqueue_ready(), and then they have stayed for almost a > > > decade. > > > > > > AFAIU this is just code move, and only really "decouples" amdgpu > > > in the > > > sense of having an official scheduler function that does what > > > amdgpu > > > used to do. > > > > > > So my tendency here would be to continue "allowing" amdgpu to > > > touch the > > > scheduler internals until amdgpu fixes this "permanently wedged > > > state". And if that's too difficult, couldn't the helper reside > > > in a > > > amdgpu/sched_helpers.c or similar? > > > > > > I think that's better than adding 1 helper for just 1 driver and > > > then > > > supposedly removing it again in the future. > > > > Yeah, agree to that general approach. > > > > What amdgpu does here is kind of nasty and looks unnecessary, but > > changing it means we need time from Hawkings and his people > > involved on > > RAS for amdgpu. > > > > When we move the code to the scheduler we make it official > > scheduler > > interface to others to replicate and that is exactly what we should > > try > > to avoid. > > > > So my suggestion is to add a /* TODO: This is nasty and should be > > avoided */ to the amdgpu code instead. > > So I got a no go to export a low level queue pop helper The spsc_queue helper in patch 3 is totally alright. Patch 3 only depends on patch 1 in the sense of it adding the new helper to the cancel_all function of patch 1, or am I missing something obvious? > , no go to move > the whole dodgy code to common (reasonable). Any third way to break > the > status quo? What if I respin with just a change local to amdgpu which > would, instead of duplicating the to_drm_sched_job macro, duplicate > __drm_sched_entity_queue_pop from 3/4 of this series? I'm willing to take patch 3 if it's independent. That would then mean that to_drm_sched_job() is only necessary anmyore in amdgpu, wouldn't it? That's independent from the cancel_all() function as far as the scheduler is concerned. P. > > Regards, > > Tvrtko > > > > > Regards, > > Christian. > > > > > > > > P. > > > > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> > > > > Cc: Christian König <christian.koenig@xxxxxxx> > > > > Cc: Danilo Krummrich <dakr@xxxxxxxxxx> > > > > Cc: Matthew Brost <matthew.brost@xxxxxxxxx> > > > > Cc: Philipp Stanner <phasta@xxxxxxxxxx> > > > > --- > > > > drivers/gpu/drm/scheduler/sched_main.c | 44 > > > > ++++++++++++++++++++++++++ > > > > include/drm/gpu_scheduler.h | 1 + > > > > 2 files changed, 45 insertions(+) > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > index a48be16ab84f..0363655db22d 100644 > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > @@ -703,6 +703,50 @@ void drm_sched_start(struct > > > > drm_gpu_scheduler > > > > *sched, int errno) > > > > } > > > > EXPORT_SYMBOL(drm_sched_start); > > > > +/** > > > > + * drm_sched_cancel_all_jobs - Cancel all queued and scheduled > > > > jobs > > > > + * > > > > + * @sched: scheduler instance > > > > + * @errno: error value to set on signaled fences > > > > + * > > > > + * Signal all queued and scheduled jobs and set them to error > > > > state. > > > > + * > > > > + * Scheduler must be stopped before calling this. > > > > + */ > > > > +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler > > > > *sched, int > > > > errno) > > > > +{ > > > > + struct drm_sched_entity *entity; > > > > + struct drm_sched_fence *s_fence; > > > > + struct drm_sched_job *job; > > > > + enum drm_sched_priority p; > > > > + > > > > + drm_WARN_ON_ONCE(sched, !sched->pause_submit); > > > > + > > > > + /* Signal all jobs not yet scheduled */ > > > > + for (p = DRM_SCHED_PRIORITY_KERNEL; p < sched->num_rqs; > > > > p++) > > > > { > > > > + struct drm_sched_rq *rq = sched->sched_rq[p]; > > > > + > > > > + spin_lock(&rq->lock); > > > > + list_for_each_entry(entity, &rq->entities, list) { > > > > + while ((job = > > > > to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) { > > > > + s_fence = job->s_fence; > > > > + dma_fence_signal(&s_fence- > > > > > scheduled); > > > > + dma_fence_set_error(&s_fence- > > > > > finished, errno); > > > > + dma_fence_signal(&s_fence- > > > > > finished); > > > > + } > > > > + } > > > > + spin_unlock(&rq->lock); > > > > + } > > > > + > > > > + /* Signal all jobs already scheduled to HW */ > > > > + list_for_each_entry(job, &sched->pending_list, list) { > > > > + s_fence = job->s_fence; > > > > + dma_fence_set_error(&s_fence->finished, errno); > > > > + dma_fence_signal(&s_fence->finished); > > > > + } > > > > +} > > > > +EXPORT_SYMBOL(drm_sched_cancel_all_jobs); > > > > + > > > > /** > > > > * drm_sched_resubmit_jobs - Deprecated, don't use in new > > > > code! > > > > * > > > > diff --git a/include/drm/gpu_scheduler.h > > > > b/include/drm/gpu_scheduler.h > > > > index a0ff08123f07..298513f8c327 100644 > > > > --- a/include/drm/gpu_scheduler.h > > > > +++ b/include/drm/gpu_scheduler.h > > > > @@ -579,6 +579,7 @@ void drm_sched_wqueue_stop(struct > > > > drm_gpu_scheduler *sched); > > > > void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched); > > > > void drm_sched_stop(struct drm_gpu_scheduler *sched, struct > > > > drm_sched_job *bad); > > > > void drm_sched_start(struct drm_gpu_scheduler *sched, int > > > > errno); > > > > +void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler > > > > *sched, int > > > > errno); > > > > void drm_sched_resubmit_jobs(struct drm_gpu_scheduler > > > > *sched); > > > > void drm_sched_increase_karma(struct drm_sched_job *bad); > > > > void drm_sched_reset_karma(struct drm_sched_job *bad); > >