Re: [PATCH 1/4] drm/scheduler: Add drm_sched_cancel_all_jobs helper

Danilo Krummrich <dakr@xxxxxxxxxx> · Thu, 6 Feb 2025 15:25:07 +0100

On Thu, Feb 06, 2025 at 02:46:40PM +0100, Christian König wrote:
> Am 06.02.25 um 14:35 schrieb Philipp Stanner:
> > On Wed, 2025-02-05 at 15:33 +0000, Tvrtko Ursulin wrote:
> > > The helper copies code from the existing
> > > amdgpu_job_stop_all_jobs_on_sched
> > > with the purpose of reducing the amount of driver code which directly
> > > touch scheduler internals.
> > > 
> > > If or when amdgpu manages to change the approach for handling the
> > > permanently wedged state this helper can be removed.
> > Have you checked how many other drivers might need such a helper?
> > 
> > I have a bit mixed feelings about this, because, AFAICT, in the past
> > helpers have been added for just 1 driver, such as
> > drm_sched_wqueue_ready(), and then they have stayed for almost a
> > decade.
> > 
> > AFAIU this is just code move, and only really "decouples" amdgpu in the
> > sense of having an official scheduler function that does what amdgpu
> > used to do.
> > 
> > So my tendency here would be to continue "allowing" amdgpu to touch the
> > scheduler internals until amdgpu fixes this "permanently wedged
> > state". And if that's too difficult, couldn't the helper reside in a
> > amdgpu/sched_helpers.c or similar?
> > 
> > I think that's better than adding 1 helper for just 1 driver and then
> > supposedly removing it again in the future.
> 
> Yeah, agree to that general approach.
> 
> What amdgpu does here is kind of nasty and looks unnecessary, but changing
> it means we need time from Hawkings and his people involved on RAS for
> amdgpu.
> 
> When we move the code to the scheduler we make it official scheduler
> interface to others to replicate and that is exactly what we should try to
> avoid.

It'd be even worse. It would mean that we create an example for drivers to be
"rewarded" with a free driver specific API by abusing the general API.

Clearly (also) a NACK from my end.