On Wed, 2024-09-04 at 19:47 +0200, Simona Vetter wrote: > On Tue, Sep 03, 2024 at 11:44:47AM +0200, Philipp Stanner wrote: > > The GPU scheduler currently does not ensure that its pending_list > > is > > empty before performing various other teardown tasks in > > drm_sched_fini(). > > > > If there are still jobs in the pending_list, this is problematic > > because > > after scheduler teardown, no one will call backend_ops.free_job() > > anymore. This would, consequently, result in memory leaks. > > > > One way to solves this is to implement a waitqueue that > > drm_sched_fini() > > blocks on until the pending_list has become empty. > > > > Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once > > the > > pending_list becomes empty. Wait in drm_sched_fini() for that to > > happen. > > > > Suggested-by: Danilo Krummrich <dakr@xxxxxxxxxx> > > Signed-off-by: Philipp Stanner <pstanner@xxxxxxxxxx> > > --- > > Hi all, > > > > since the scheduler has many stake holders, I want this solution > > discussed as an RFC first. > > > > This version here has IMO the advantage (and disadvantage...) that > > it > > blocks infinitly if the driver messed up the clean up, so problems > > might become more visible than the refcount solution I proposed in > > parallel. > > Very quick comment because I'm heading out for the r4l conference, > but > maybe I can discuss this there with Danilo a bit. > > Maybe we should do step 0 first, and document the current rules? The > kerneldoc isn't absolute zero anymore, but it's very, very bare- > bones. > Then get that acked and merged, which is a very good way to make sure > we're actually standing on common ground. Yes, documentation is definitely also on my TODO list. I wanted to send out something clarifying the objects' lifetimes (based on Christian's previous series [1]) quite soon. > > Then maybe step 0.5 would be to add runtime asserts to enforce the > rules, > like if you tear down the scheduler and there's stuff in flight, you > splat > on a WARN_ON. > > With that foundation there should be a lot clearer basis to discuss > whether there is an issue, and what a better design could look like. I mean, I'm quite sure that there are teardown issues. But we could indeed make them visible first through documentation (and a FIXME tag) and after establishing consensus through merging that go on as you suggested. > The > little pondering I've done I've come up with a few more ideas along > similar lines. > > One comment below, kinda unrelated. > > > > > Cheers, > > P. > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 40 > > ++++++++++++++++++++++++++ > > include/drm/gpu_scheduler.h | 4 +++ > > 2 files changed, 44 insertions(+) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > b/drivers/gpu/drm/scheduler/sched_main.c > > index 7e90c9f95611..200fa932f289 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -564,6 +564,13 @@ static void drm_sched_job_timedout(struct > > work_struct *work) > > * is parked at which point it's safe. > > */ > > list_del_init(&job->list); > > + > > + /* > > + * Inform tasks blocking in drm_sched_fini() that > > it's now safe to proceed. > > + */ > > + if (list_empty(&sched->pending_list)) > > + wake_up(&sched->job_list_waitque); > > + > > spin_unlock(&sched->job_list_lock); > > > > status = job->sched->ops->timedout_job(job); > > @@ -584,6 +591,15 @@ static void drm_sched_job_timedout(struct > > work_struct *work) > > drm_sched_start_timeout_unlocked(sched); > > } > > > > +static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler > > *sched) > > +{ > > + /* > > + * For list_empty() to work without a lock. > > + */ > > So this is pretty far from the gold standard for documenting memory > barrier semantics in lockless code. Ideally we have a comment for > both > sides of the barrier (you always need two, or there's no function > barrier), pointing at each another, and explaining exactly what's > being > synchronized against what and how. > > I did years ago add a few missing barriers with that approach, see > b0a5303d4e14 ("drm/sched: Barriers are needed for > entity->last_scheduled"). Reading that patch again it feels a bit on > the > terse side of things (plus I noticed spelling issues now too, oops). ACK, will do that properly once we role out the actual patch. P. > > Cheers, Sima > > > + rmb(); > > + return list_empty(&sched->pending_list); > > +} > > + > > /** > > * drm_sched_stop - stop the scheduler > > * > > @@ -659,6 +675,12 @@ void drm_sched_stop(struct drm_gpu_scheduler > > *sched, struct drm_sched_job *bad) > > } > > } > > > > + /* > > + * Inform tasks blocking in drm_sched_fini() that it's now > > safe to proceed. > > + */ > > + if (drm_sched_no_jobs_pending(sched)) > > + wake_up(&sched->job_list_waitque); > > + > > /* > > * Stop pending timer in flight as we rearm it in > > drm_sched_start. This > > * avoids the pending timeout work in progress to fire > > right away after > > @@ -1085,6 +1107,12 @@ drm_sched_get_finished_job(struct > > drm_gpu_scheduler *sched) > > /* remove job from pending_list */ > > list_del_init(&job->list); > > > > + /* > > + * Inform tasks blocking in drm_sched_fini() that > > it's now safe to proceed. > > + */ > > + if (list_empty(&sched->pending_list)) > > + wake_up(&sched->job_list_waitque); > > + > > /* cancel this job's TO timer */ > > cancel_delayed_work(&sched->work_tdr); > > /* make the scheduled timestamp more accurate */ > > @@ -1303,6 +1331,7 @@ int drm_sched_init(struct drm_gpu_scheduler > > *sched, > > init_waitqueue_head(&sched->job_scheduled); > > INIT_LIST_HEAD(&sched->pending_list); > > spin_lock_init(&sched->job_list_lock); > > + init_waitqueue_head(&sched->job_list_waitque); > > atomic_set(&sched->credit_count, 0); > > INIT_DELAYED_WORK(&sched->work_tdr, > > drm_sched_job_timedout); > > INIT_WORK(&sched->work_run_job, drm_sched_run_job_work); > > @@ -1333,12 +1362,23 @@ EXPORT_SYMBOL(drm_sched_init); > > * @sched: scheduler instance > > * > > * Tears down and cleans up the scheduler. > > + * > > + * Note that this function blocks until the fences returned by > > + * backend_ops.run_job() have been signalled. > > */ > > void drm_sched_fini(struct drm_gpu_scheduler *sched) > > { > > struct drm_sched_entity *s_entity; > > int i; > > > > + /* > > + * Jobs that have neither been scheduled or which have > > timed out are > > + * gone by now, but jobs that have been submitted through > > + * backend_ops.run_job() and have not yet terminated are > > still pending. > > + * > > + * Wait for the pending_list to become empty to avoid > > leaking those jobs. > > + */ > > + wait_event(sched->job_list_waitque, > > drm_sched_no_jobs_pending(sched)); > > drm_sched_wqueue_stop(sched); > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; > > i++) { > > diff --git a/include/drm/gpu_scheduler.h > > b/include/drm/gpu_scheduler.h > > index 5acc64954a88..bff092784405 100644 > > --- a/include/drm/gpu_scheduler.h > > +++ b/include/drm/gpu_scheduler.h > > @@ -29,6 +29,7 @@ > > #include <linux/completion.h> > > #include <linux/xarray.h> > > #include <linux/workqueue.h> > > +#include <linux/wait.h> > > > > #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000) > > > > @@ -503,6 +504,8 @@ struct drm_sched_backend_ops { > > * timeout interval is over. > > * @pending_list: the list of jobs which are currently in the job > > queue. > > * @job_list_lock: lock to protect the pending_list. > > + * @job_list_waitque: a waitqueue for drm_sched_fini() to block on > > until all > > + * pending jobs have been finished. > > * @hang_limit: once the hangs by a job crosses this limit then it > > is marked > > * guilty and it will no longer be considered for > > scheduling. > > * @score: score to help loadbalancer pick a idle sched > > @@ -532,6 +535,7 @@ struct drm_gpu_scheduler { > > struct delayed_work work_tdr; > > struct list_head pending_list; > > spinlock_t job_list_lock; > > + wait_queue_head_t job_list_waitque; > > int hang_limit; > > atomic_t *score; > > atomic_t _score; > > -- > > 2.46.0 > > >