On Thu, Feb 06, 2025 at 02:46:40PM +0100, Christian König wrote: > Am 06.02.25 um 14:35 schrieb Philipp Stanner: > > On Wed, 2025-02-05 at 15:33 +0000, Tvrtko Ursulin wrote: > > > The helper copies code from the existing > > > amdgpu_job_stop_all_jobs_on_sched > > > with the purpose of reducing the amount of driver code which directly > > > touch scheduler internals. > > > > > > If or when amdgpu manages to change the approach for handling the > > > permanently wedged state this helper can be removed. > > Have you checked how many other drivers might need such a helper? > > > > I have a bit mixed feelings about this, because, AFAICT, in the past > > helpers have been added for just 1 driver, such as > > drm_sched_wqueue_ready(), and then they have stayed for almost a > > decade. > > > > AFAIU this is just code move, and only really "decouples" amdgpu in the > > sense of having an official scheduler function that does what amdgpu > > used to do. > > > > So my tendency here would be to continue "allowing" amdgpu to touch the > > scheduler internals until amdgpu fixes this "permanently wedged > > state". And if that's too difficult, couldn't the helper reside in a > > amdgpu/sched_helpers.c or similar? > > > > I think that's better than adding 1 helper for just 1 driver and then > > supposedly removing it again in the future. > > Yeah, agree to that general approach. > > What amdgpu does here is kind of nasty and looks unnecessary, but changing > it means we need time from Hawkings and his people involved on RAS for > amdgpu. > > When we move the code to the scheduler we make it official scheduler > interface to others to replicate and that is exactly what we should try to > avoid. It'd be even worse. It would mean that we create an example for drivers to be "rewarded" with a free driver specific API by abusing the general API. Clearly (also) a NACK from my end.