On 2023-07-19 04:45, Christian König wrote: > Am 16.07.23 um 09:51 schrieb Asahi Lina: >> On 15/07/2023 16.14, Luben Tuikov wrote: >>> On 2023-07-14 04:21, Asahi Lina wrote: >>>> drm_sched_fini() currently leaves any pending jobs dangling, which >>>> causes segfaults and other badness when job completion fences are >>>> signaled after the scheduler is torn down. >>> >>> If there are pending jobs, ideally we want to call into the driver, >>> so that it can release resources it may be holding for those. >>> The idea behind "pending" is that they are pending in the hardware >>> and we don't know their state until signalled/the callback called. >>> (Or unless the device is reset and we get a notification of that fact.) >> >> That's what the job->free_job() callback does, then the driver is free >> to do whatever it wants with those jobs. A driver could opt to >> synchronously kill those jobs (if it can) or account for them >> separately/asynchronously. >> >> What this patch basically says is that if you destroy a scheduler with >> pending jobs, it immediately considers them terminated with an error, >> and returns ownership back to the driver for freeing. Then the driver >> can decide how to handle the rest and whatever the underlying hardware >> state is. > > Yeah, and exactly that is absolutely *not* a good idea. Keep in mind > that memory management depends on all this stuff and signal a dma_fence > always requires that the hw give a go for that. > > If you want to cleanup a scheduler with pending jobs what needs to > happen instead is that the driver cancels the processing and signals the > hw fence. > >> >>>> Explicitly detach all jobs from their completion callbacks and free >>>> them. This makes it possible to write a sensible safe abstraction for >>>> drm_sched, without having to externally duplicate the tracking of >>>> in-flight jobs. >>>> >>>> This shouldn't regress any existing drivers, since calling >>>> drm_sched_fini() with any pending jobs is broken and this change should >>>> be a no-op if there are no pending jobs. >>> >>> While this statement is true on its own, it kind of contradicts >>> the premise of the first paragraph. >> >> I mean right *now* it's broken, before this patch. I'm trying to make >> it safe, but it shouldn't regress any exiting drivers since if they >> trigger this code path they are broken today. > > Yes and exactly that's intentional. > > What you can do is to issue a *big* warning here when there are still > pending unsignaled hw fences when the driver calls drm_sched_fini(). > > But setting the scheduler fence to signaled without getting a signaled > state from the hw fence is a complete NO-GO. Okay, so we have the requirement (how). If we can also get a reason behind it (why), perhaps we can add the requirement and the reason as a lucid comment to drm_sched_fini() to come with this patch when reworked, so that future drivers whether they be in Rust or C, can take note. Perhaps this will also help future development in DRM itself. -- Regards, Luben > > Regards, > Christian. > >> >>> >>>> Signed-off-by: Asahi Lina <lina@xxxxxxxxxxxxx> >>>> --- >>>> drivers/gpu/drm/scheduler/sched_main.c | 32 >>>> ++++++++++++++++++++++++++++++-- >>>> 1 file changed, 30 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>> index 1f3bc3606239..a4da4aac0efd 100644 >>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init); >>>> void drm_sched_fini(struct drm_gpu_scheduler *sched) >>>> { >>>> struct drm_sched_entity *s_entity; >>>> + struct drm_sched_job *s_job, *tmp; >>>> int i; >>>> - if (sched->thread) >>>> - kthread_stop(sched->thread); >>>> + if (!sched->thread) >>>> + return; >>>> + >>>> + /* >>>> + * Stop the scheduler, detaching all jobs from their hardware >>>> callbacks >>>> + * and cleaning up complete jobs. >>>> + */ >>>> + drm_sched_stop(sched, NULL); >>>> + >>>> + /* >>>> + * Iterate through the pending job list and free all jobs. >>>> + * This assumes the driver has either guaranteed jobs are >>>> already stopped, or that >>>> + * otherwise it is responsible for keeping any necessary data >>>> structures for >>>> + * in-progress jobs alive even when the free_job() callback is >>>> called early (e.g. by >>>> + * putting them in its own queue or doing its own refcounting). >>>> + */ >>>> + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { >>>> + spin_lock(&sched->job_list_lock); >>>> + list_del_init(&s_job->list); >>>> + spin_unlock(&sched->job_list_lock); >>>> + >>>> + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH); >>>> + drm_sched_fence_finished(s_job->s_fence); >>> >>> I'd imagine it's better to rebase this on top of drm-misc-next where >>> drm_sched_fence_finished() takes one more parameter--the error. >> >> Ah, sure! I can do that. >> >>> >>>> + >>>> + WARN_ON(s_job->s_fence->parent); >>>> + sched->ops->free_job(s_job); >>>> + } >>>> + >>>> + kthread_stop(sched->thread); >>>> for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= >>>> DRM_SCHED_PRIORITY_MIN; i--) { >>>> struct drm_sched_rq *rq = &sched->sched_rq[i]; >>>> >>> >>> Conceptually I don't mind this patch--I see what it is trying to >>> achieve, >>> but technically, we want the driver to detect GPU removal and return >>> shared >>> resources back, such as "jobs", which DRM is also aware of. >> >> I think you missed the context of why I'm doing this, so in short: my >> use case (like Xe's) involves using a separate drm_sched instance *per >> file* since these queues are scheduled directly by the firmware. So >> this isn't about GPU removal, but rather about a GPU context going >> away while jobs are in flight (e.g. the process got killed). We want >> that to quickly kill the "DRM view" of the world, including signaling >> all the fences with an error and freeing resources like the scheduler >> itself. >> >> In the case of this particular GPU, there is no known way to actively >> and synchronously abort GPU jobs, so we need to let them run to >> completion (or failure), but we don't want that to block process >> cleanup and freeing a bunch of high-level resources. The driver is >> architected roughly along the lines of a firmware abstraction layer >> that maps to the firmware shared memory structures, and then a layer >> on top that implements the DRM view. When a process gets killed, the >> DRM side (which includes the scheduler, etc.) gets torn down >> immediately, and it makes sense to handle this cleanup inside >> drm_sched since it already has a view into what jobs are in flight. >> Otherwise, I would have to duplicate job tracking in the driver >> (actually worse: in the Rust abstraction for safety), which doesn't >> make much sense. >> >> But what I *do* have in the driver is tracking of the firmware >> structures. So when the drm_sched gets torn down and all the jobs >> killed, the underlying firmware jobs do run to completion, and the >> resources they use are all cleaned up after that (it's all reference >> counted). The primitive involved here is that in-flight firmware jobs >> are assigned an event completion slot, and that keeps a reference to >> them from a global array until the events fire and all the jobs are >> known to have completed. This keeps things memory-safe, since we >> absolutely cannot free/destroy firmware structures while they are in >> use (otherwise the firmware crashes, which is fatal on these GPUs - >> requires a full system reboot to recover). >> >> In practice, with the VM map model that we use, what ends up happening >> when a process gets killed is that all the user objects for in-flight >> jobs get unmapped, which usually causes the GPU hardware (not >> firmware) to fault. This then triggers early termination of jobs >> anyway via the firmware fault recovery flow. But even that takes some >> short amount of time, and by then all the drm_sched stuff is long gone >> and we're just dealing with the in-flight firmware stuff. >> >>> In the case where we're initiating the tear, we should notify the >>> driver that >>> we're about to forget jobs (resources), so that it knows to return >>> them back >>> or that it shouldn't notify us for them (since we've notified we're >>> forgetting them.) >> >> That contradicts Christian's comment. I tried to document that (after >> this patch) the scheduler no longer cares about hw fences and whether >> they are signaled or not after it's destroyed, and I got a strongly >> worded NAK for it. Sooo... which is it? Is it okay for drivers not to >> signal the hw fence after a scheduler teardown, or not? >> >> But really, I don't see a use case for an explicit "about to forget >> job" callback. The job free callback already serves the purpose of >> telling the driver to clean up resources associated with a job. If it >> wants to synchronously abort things there, it could easily take over >> its own fence signaling and do something with the underlying stuff if >> the fence is not signaled yet. >> >> In my case, since the driver is written in Rust and free_job() just >> maps to the destructor (Drop impl), that just ends up freeing a bunch >> of memory and other objects, and I don't particularly care about the >> state of the firmware side any more after that. The flow is the same >> whether it was a successful job completion, a failure, or an early >> destruction due to the drm_sched getting torn down. >> >>> (Note also that in this latter case, traditionally, the device would >>> be reset, >>> so that we can guarantee that it has forgotten all shared resources >>> which >>> we are to tear up. This is somewhat more complicated with GPUs, thus >>> the method >>> pointed out above.) >> >> Yeah, in the firmware scheduling case we can't do this at all unless >> the firmware has an explicit teardown/forget op (which I'm not aware >> of) and a full GPU reset isn't something we can do either. Hence we >> just let the underlying jobs complete. In practice they tend to die >> pretty quickly anyway once all the buffers are unmapped. >> >> ~~ Lina >> >