On 2023-08-02 00:06, Matthew Brost wrote: > On Mon, Jul 17, 2023 at 01:40:38PM -0400, Luben Tuikov wrote: >> On 2023-07-16 03:51, Asahi Lina wrote: >>> On 15/07/2023 16.14, Luben Tuikov wrote: >>>> On 2023-07-14 04:21, Asahi Lina wrote: >>>>> drm_sched_fini() currently leaves any pending jobs dangling, which >>>>> causes segfaults and other badness when job completion fences are >>>>> signaled after the scheduler is torn down. >>>> >>>> If there are pending jobs, ideally we want to call into the driver, >>>> so that it can release resources it may be holding for those. >>>> The idea behind "pending" is that they are pending in the hardware >>>> and we don't know their state until signalled/the callback called. >>>> (Or unless the device is reset and we get a notification of that fact.) >>> >>> That's what the job->free_job() callback does, then the driver is free >>> to do whatever it wants with those jobs. A driver could opt to >>> synchronously kill those jobs (if it can) or account for them >>> separately/asynchronously. >>> >>> What this patch basically says is that if you destroy a scheduler with >>> pending jobs, it immediately considers them terminated with an error, >>> and returns ownership back to the driver for freeing. Then the driver >>> can decide how to handle the rest and whatever the underlying hardware >>> state is. >>> >>>>> Explicitly detach all jobs from their completion callbacks and free >>>>> them. This makes it possible to write a sensible safe abstraction for >>>>> drm_sched, without having to externally duplicate the tracking of >>>>> in-flight jobs. >>>>> >>>>> This shouldn't regress any existing drivers, since calling >>>>> drm_sched_fini() with any pending jobs is broken and this change should >>>>> be a no-op if there are no pending jobs. >>>> >>>> While this statement is true on its own, it kind of contradicts >>>> the premise of the first paragraph. >>> >>> I mean right *now* it's broken, before this patch. I'm trying to make it >>> safe, but it shouldn't regress any exiting drivers since if they trigger >>> this code path they are broken today. >> >> Not sure about other drivers--they can speak for themselves and the CC list >> should include them--please use "dim add-missing-cc" and make sure >> that the Git commit description contains the Cc tags--then git send-email >> will populate the SMTP CC. Feel free to add more Cc tags on top of that. >> > > Xe doesn't need this as our reference counting scheme doesn't allow > drm_sched_fini to be called when jobs are pending. If we want to > teardown a drm_sched we set TDR timeout to zero and all pending jobs > gets cleaned up that way, the ref of sched will go to zero, and > drm_sched_fini is called. The caveat here being I think we need a worker > to call drm_sched_fini as the last ref to scheduler might be dropped > from within scheduler main thread. > > That being said, I doubt this patch breaks anything in Xe so do not a > real strong opinion on this. Yes, that's my understanding as well. If the drivers call drm_sched_fini() then they are responsible for cleaning up _before_ calling drm_sched_fini(). The Xe driver seems to be doing the right thing. All in all, since drm_sched_fini() is called by the driver, the driver is supposed to have cleaned up before calling it, so I don't see much need for this patch after all. -- Regards, Luben