Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item

Matthew Brost <matthew.brost@xxxxxxxxx> · Wed, 23 Aug 2023 03:27:19 +0000

On Mon, Aug 21, 2023 at 03:17:29PM +0200, Christian König wrote:
> Am 18.08.23 um 15:13 schrieb Matthew Brost:
> > On Fri, Aug 18, 2023 at 07:27:33AM +0200, Christian König wrote:
> > > Am 17.08.23 um 19:54 schrieb Matthew Brost:
> > > > On Thu, Aug 17, 2023 at 03:39:40PM +0200, Christian König wrote:
> > > > > Am 11.08.23 um 04:31 schrieb Matthew Brost:
> > > > > > Rather than call free_job and run_job in same work item have a dedicated
> > > > > > work item for each. This aligns with the design and intended use of work
> > > > > > queues.
> > > > > I would rather say we should get completely rid of the free_job callback.
> > > > > 
> > > > Would we still have work item? e.g. Would we still want to call
> > > > drm_sched_get_cleanup_job which removes the job from the pending list
> > > > and adjusts the TDR? Trying to figure out out what this looks like. We
> > > > probably can't do all of this from an IRQ context.
> > > > 
> > > > > Essentially the job is just the container which carries the information
> > > > > which are necessary before you push it to the hw. The real representation of
> > > > > the submission is actually the scheduler fence.
> > > > > 
> > > > Most of the free_jobs call plus drm_sched_job_cleanup + a put on job. In
> > > > Xe this cannot be called from an IRQ context either.
> > > > 
> > > > I'm just confused what exactly you are suggesting here.
> > > To summarize on one sentence: Instead of the job we keep the scheduler and
> > > hardware fences around after pushing the job to the hw.
> > > 
> > > The free_job callback would then be replaced by dropping the reference on
> > > the scheduler and hw fence.
> > > 
> > > Would that work for you?
> > > 
> > I don't think so for a few reasons.
> > 
> > The job and hw fence are different structures (also different allocs too)
> > for a reason. The job referenced until it is complete (hw fence is
> > signaled) and the free_job is called. This reference is needed for the
> > TDR to work properly and also some reset flows too.
> 
> That is exactly what I want to avoid, tying the TDR to the job is what some
> AMD engineers pushed for because it looked like a simple solution and made
> the whole thing similar to what Windows does.
> 
> This turned the previous relatively clean scheduler and TDR design into a
> complete nightmare. The job contains quite a bunch of things which are not
> necessarily available after the application which submitted the job is torn
> down.
>

Agree the TDR shouldn't be accessing anything application specific
rather just internal job state required to tear the job down on the
hardware.

> So what happens is that you either have stale pointers in the TDR which can
> go boom extremely easily or we somehow find a way to keep the necessary

I have not experenced the TDR going boom in Xe.

> structures (which include struct thread_info and struct file for this driver
> connection) alive until all submissions are completed.
> 

In Xe we keep everything alive until all submissions are completed. By
everything I mean the drm job, entity, scheduler, and VM via a reference
counting scheme. All of these structures are just kernel state which can
safely be accessed even if the application has been killed.

If we need to teardown on demand we just set the TDR to a minimum value and
it kicks the jobs off the hardware, gracefully cleans everything up and
drops all references. This is a benefit of the 1 to 1 relationship, not
sure if this works with how AMDGPU uses the scheduler.

> Delaying application tear down is also not an option because then you run
> into massive trouble with the OOM killer (or more generally OOM handling).
> See what we do in drm_sched_entity_flush() as well.
> 

Not an issue for Xe, we never call drm_sched_entity_flush as our
referencing counting scheme is all jobs are finished before we attempt
to tear down entity / scheduler.

> Since adding the TDR support we completely exercised this through in the
> last two or three years or so. And to sum it up I would really like to get
> away from this mess again.
> 
> Compared to that what i915 does is actually rather clean I think.
> 

Not even close, resets where a nightmare in the i915 (I spend years
trying to get this right and probably still completely work) and in Xe
basically got it right on the attempt.

> >   Also in Xe some of
> > things done in free_job cannot be from an IRQ context, hence calling
> > this from the scheduler worker is rather helpful.
> 
> Well putting things for cleanup into a workitem doesn't sounds like
> something hard.
>

That is exactly what we doing in the scheduler with the free_job
workitem.

> Question is what do you really need for TDR which is not inside the hardware
> fence?
>

A reference to the entity to be able to kick the job off the hardware.
A reference to the entity, job, and VM for error capture.

We also need a reference to the job for recovery after a GPU reset so
run_job can be called again for innocent jobs.

All of this leads to believe we need to stick with the design.

Matt

> Regards,
> Christian.
> 
> > 
> > The HW fence can live for longer as it can be installed in dma-resv
> > slots, syncobjs, etc... If the job and hw fence are combined now we
> > holding on the memory for the longer and perhaps at the mercy of the
> > user. We also run the risk of the final put being done from an IRQ
> > context which again wont work in Xe as it is currently coded. Lastly 2
> > jobs from the same scheduler could do the final put in parallel, so
> > rather than having free_job serialized by the worker now multiple jobs
> > are freeing themselves at the same time. This might not be an issue but
> > adds another level of raceyness that needs to be accounted for. None of
> > this sounds desirable to me.
> > 
> > FWIW what you suggesting sounds like how the i915 did things
> > (i915_request and hw fence in 1 memory alloc) and that turned out to be
> > a huge mess. As rule of thumb I generally do the opposite of whatever
> > the i915 did.
> > 
> > Matt
> > 
> > > Christian.
> > > 
> > > > Matt
> > > > 
> > > > > All the lifetime issues we had came from ignoring this fact and I think we
> > > > > should push for fixing this design up again.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > > > > ---
> > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> > > > > >     include/drm/gpu_scheduler.h            |   8 +-
> > > > > >     2 files changed, 106 insertions(+), 39 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > index cede47afc800..b67469eac179 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > > > > >      * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> > > > > >      *
> > > > > >      * @rq: scheduler run queue to check.
> > > > > > + * @dequeue: dequeue selected entity
> > > > > >      *
> > > > > >      * Try to find a ready entity, returns NULL if none found.
> > > > > >      */
> > > > > >     static struct drm_sched_entity *
> > > > > > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> > > > > >     {
> > > > > >     	struct drm_sched_entity *entity;
> > > > > > @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > >     	if (entity) {
> > > > > >     		list_for_each_entry_continue(entity, &rq->entities, list) {
> > > > > >     			if (drm_sched_entity_is_ready(entity)) {
> > > > > > -				rq->current_entity = entity;
> > > > > > -				reinit_completion(&entity->entity_idle);
> > > > > > +				if (dequeue) {
> > > > > > +					rq->current_entity = entity;
> > > > > > +					reinit_completion(&entity->entity_idle);
> > > > > > +				}
> > > > > >     				spin_unlock(&rq->lock);
> > > > > >     				return entity;
> > > > > >     			}
> > > > > > @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > >     	list_for_each_entry(entity, &rq->entities, list) {
> > > > > >     		if (drm_sched_entity_is_ready(entity)) {
> > > > > > -			rq->current_entity = entity;
> > > > > > -			reinit_completion(&entity->entity_idle);
> > > > > > +			if (dequeue) {
> > > > > > +				rq->current_entity = entity;
> > > > > > +				reinit_completion(&entity->entity_idle);
> > > > > > +			}
> > > > > >     			spin_unlock(&rq->lock);
> > > > > >     			return entity;
> > > > > >     		}
> > > > > > @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > >      * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> > > > > >      *
> > > > > >      * @rq: scheduler run queue to check.
> > > > > > + * @dequeue: dequeue selected entity
> > > > > >      *
> > > > > >      * Find oldest waiting ready entity, returns NULL if none found.
> > > > > >      */
> > > > > >     static struct drm_sched_entity *
> > > > > > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> > > > > >     {
> > > > > >     	struct rb_node *rb;
> > > > > > @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > >     		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> > > > > >     		if (drm_sched_entity_is_ready(entity)) {
> > > > > > -			rq->current_entity = entity;
> > > > > > -			reinit_completion(&entity->entity_idle);
> > > > > > +			if (dequeue) {
> > > > > > +				rq->current_entity = entity;
> > > > > > +				reinit_completion(&entity->entity_idle);
> > > > > > +			}
> > > > > >     			break;
> > > > > >     		}
> > > > > >     	}
> > > > > > @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > >     }
> > > > > >     /**
> > > > > > - * drm_sched_submit_queue - scheduler queue submission
> > > > > > + * drm_sched_run_job_queue - queue job submission
> > > > > >      * @sched: scheduler instance
> > > > > >      */
> > > > > > -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > > > > > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> > > > > >     {
> > > > > >     	if (!READ_ONCE(sched->pause_submit))
> > > > > > -		queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > +		queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > +}
> > > > > > +
> > > > > > +static struct drm_sched_entity *
> > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > > > > > + * @sched: scheduler instance
> > > > > > + */
> > > > > > +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > +{
> > > > > > +	if (drm_sched_select_entity(sched, false))
> > > > > > +		drm_sched_run_job_queue(sched);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_sched_free_job_queue - queue free job
> > > > > > + *
> > > > > > + * @sched: scheduler instance to queue free job
> > > > > > + */
> > > > > > +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > > > > > +{
> > > > > > +	if (!READ_ONCE(sched->pause_submit))
> > > > > > +		queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > > > > > + *
> > > > > > + * @sched: scheduler instance to queue free job
> > > > > > + */
> > > > > > +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > +{
> > > > > > +	struct drm_sched_job *job;
> > > > > > +
> > > > > > +	spin_lock(&sched->job_list_lock);
> > > > > > +	job = list_first_entry_or_null(&sched->pending_list,
> > > > > > +				       struct drm_sched_job, list);
> > > > > > +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > > > > > +		drm_sched_free_job_queue(sched);
> > > > > > +	spin_unlock(&sched->job_list_lock);
> > > > > >     }
> > > > > >     /**
> > > > > > @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> > > > > >     	dma_fence_get(&s_fence->finished);
> > > > > >     	drm_sched_fence_finished(s_fence, result);
> > > > > >     	dma_fence_put(&s_fence->finished);
> > > > > > -	drm_sched_submit_queue(sched);
> > > > > > +	drm_sched_free_job_queue(sched);
> > > > > >     }
> > > > > >     /**
> > > > > > @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> > > > > >     void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > > > > >     {
> > > > > >     	if (drm_sched_can_queue(sched))
> > > > > > -		drm_sched_submit_queue(sched);
> > > > > > +		drm_sched_run_job_queue(sched);
> > > > > >     }
> > > > > >     /**
> > > > > >      * drm_sched_select_entity - Select next entity to process
> > > > > >      *
> > > > > >      * @sched: scheduler instance
> > > > > > + * @dequeue: dequeue selected entity
> > > > > >      *
> > > > > >      * Returns the entity to process or NULL if none are found.
> > > > > >      */
> > > > > >     static struct drm_sched_entity *
> > > > > > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> > > > > >     {
> > > > > >     	struct drm_sched_entity *entity;
> > > > > >     	int i;
> > > > > > @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > >     	/* Kernel run queue has higher priority than normal run queue*/
> > > > > >     	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > > > > >     		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > > > > > -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > > > > > -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > > > > > +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > > > > > +							dequeue) :
> > > > > > +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > > > > > +						      dequeue);
> > > > > >     		if (entity)
> > > > > >     			break;
> > > > > >     	}
> > > > > > @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > > > >     EXPORT_SYMBOL(drm_sched_pick_best);
> > > > > >     /**
> > > > > > - * drm_sched_main - main scheduler thread
> > > > > > + * drm_sched_free_job_work - worker to call free_job
> > > > > >      *
> > > > > > - * @param: scheduler instance
> > > > > > + * @w: free job work
> > > > > >      */
> > > > > > -static void drm_sched_main(struct work_struct *w)
> > > > > > +static void drm_sched_free_job_work(struct work_struct *w)
> > > > > >     {
> > > > > >     	struct drm_gpu_scheduler *sched =
> > > > > > -		container_of(w, struct drm_gpu_scheduler, work_submit);
> > > > > > -	struct drm_sched_entity *entity;
> > > > > > +		container_of(w, struct drm_gpu_scheduler, work_free_job);
> > > > > >     	struct drm_sched_job *cleanup_job;
> > > > > > -	int r;
> > > > > >     	if (READ_ONCE(sched->pause_submit))
> > > > > >     		return;
> > > > > >     	cleanup_job = drm_sched_get_cleanup_job(sched);
> > > > > > -	entity = drm_sched_select_entity(sched);
> > > > > > +	if (cleanup_job) {
> > > > > > +		sched->ops->free_job(cleanup_job);
> > > > > > +
> > > > > > +		drm_sched_free_job_queue_if_ready(sched);
> > > > > > +		drm_sched_run_job_queue_if_ready(sched);
> > > > > > +	}
> > > > > > +}
> > > > > > -	if (!entity && !cleanup_job)
> > > > > > -		return;	/* No more work */
> > > > > > +/**
> > > > > > + * drm_sched_run_job_work - worker to call run_job
> > > > > > + *
> > > > > > + * @w: run job work
> > > > > > + */
> > > > > > +static void drm_sched_run_job_work(struct work_struct *w)
> > > > > > +{
> > > > > > +	struct drm_gpu_scheduler *sched =
> > > > > > +		container_of(w, struct drm_gpu_scheduler, work_run_job);
> > > > > > +	struct drm_sched_entity *entity;
> > > > > > +	int r;
> > > > > > -	if (cleanup_job)
> > > > > > -		sched->ops->free_job(cleanup_job);
> > > > > > +	if (READ_ONCE(sched->pause_submit))
> > > > > > +		return;
> > > > > > +	entity = drm_sched_select_entity(sched, true);
> > > > > >     	if (entity) {
> > > > > >     		struct dma_fence *fence;
> > > > > >     		struct drm_sched_fence *s_fence;
> > > > > > @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > > >     		sched_job = drm_sched_entity_pop_job(entity);
> > > > > >     		if (!sched_job) {
> > > > > >     			complete_all(&entity->entity_idle);
> > > > > > -			if (!cleanup_job)
> > > > > > -				return;	/* No more work */
> > > > > > -			goto again;
> > > > > > +			return;	/* No more work */
> > > > > >     		}
> > > > > >     		s_fence = sched_job->s_fence;
> > > > > > @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
> > > > > >     		}
> > > > > >     		wake_up(&sched->job_scheduled);
> > > > > > +		drm_sched_run_job_queue_if_ready(sched);
> > > > > >     	}
> > > > > > -
> > > > > > -again:
> > > > > > -	drm_sched_submit_queue(sched);
> > > > > >     }
> > > > > >     /**
> > > > > > @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > > > >     	spin_lock_init(&sched->job_list_lock);
> > > > > >     	atomic_set(&sched->hw_rq_count, 0);
> > > > > >     	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > > > -	INIT_WORK(&sched->work_submit, drm_sched_main);
> > > > > > +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > > > > > +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> > > > > >     	atomic_set(&sched->_score, 0);
> > > > > >     	atomic64_set(&sched->job_id_count, 0);
> > > > > >     	sched->pause_submit = false;
> > > > > > @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
> > > > > >     void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
> > > > > >     {
> > > > > >     	WRITE_ONCE(sched->pause_submit, true);
> > > > > > -	cancel_work_sync(&sched->work_submit);
> > > > > > +	cancel_work_sync(&sched->work_run_job);
> > > > > > +	cancel_work_sync(&sched->work_free_job);
> > > > > >     }
> > > > > >     EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > > @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > >     void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
> > > > > >     {
> > > > > >     	WRITE_ONCE(sched->pause_submit, false);
> > > > > > -	queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > +	queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > +	queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > >     }
> > > > > >     EXPORT_SYMBOL(drm_sched_submit_start);
> > > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > > > index 04eec2d7635f..fbc083a92757 100644
> > > > > > --- a/include/drm/gpu_scheduler.h
> > > > > > +++ b/include/drm/gpu_scheduler.h
> > > > > > @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
> > > > > >      *                 finished.
> > > > > >      * @hw_rq_count: the number of jobs currently in the hardware queue.
> > > > > >      * @job_id_count: used to assign unique id to the each job.
> > > > > > - * @submit_wq: workqueue used to queue @work_submit
> > > > > > + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
> > > > > >      * @timeout_wq: workqueue used to queue @work_tdr
> > > > > > - * @work_submit: schedules jobs and cleans up entities
> > > > > > + * @work_run_job: schedules jobs
> > > > > > + * @work_free_job: cleans up jobs
> > > > > >      * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > > > >      *            timeout interval is over.
> > > > > >      * @pending_list: the list of jobs which are currently in the job queue.
> > > > > > @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
> > > > > >     	atomic64_t			job_id_count;
> > > > > >     	struct workqueue_struct		*submit_wq;
> > > > > >     	struct workqueue_struct		*timeout_wq;
> > > > > > -	struct work_struct		work_submit;
> > > > > > +	struct work_struct		work_run_job;
> > > > > > +	struct work_struct		work_free_job;
> > > > > >     	struct delayed_work		work_tdr;
> > > > > >     	struct list_head		pending_list;
> > > > > >     	spinlock_t			job_list_lock;
>