On Mon, Aug 07, 2023 at 05:46:16PM +0200, Christian König wrote: > Am 04.08.23 um 16:13 schrieb Matthew Brost: > > [SNIP] > > Christian / Daniel - I've read both of you comments and having a hard > > time parsing them. I do not really understand the issue with this patch > > or exactly what is being suggested instead. Let's try to work through > > this. > > > > > > > > I'm still extremely frowned on this. > > > > > > > > > > > > If you need this functionality then let the drivers decide which > > > > > > runqueue the scheduler should use. > > What do you mean by runqueue here? Do you mean 'struct > > workqueue_struct'? The scheduler in this context is 'struct > > drm_gpu_scheduler', right? > > Sorry for the confusing wording, your understanding is correct. > > > Yes, we have added this functionality iin the first patch. > > > > > > > > When you then create a single threaded runqueue you can just submit work > > > > > > to it and serialize this with the scheduler work. > > > > > > > > We don't want to use a single threaded workqueue_struct in Xe, we want > > to use a system_wq as run_job() can be executed in parallel across > > multiple entites (or drm_gpu_scheduler as in Xe we have 1 to 1 > > relationship between entity and scheduler). What we want is on per > > entity / scheduler granularity to be able to communicate into the > > backend a message synchronously (run_job / free_job not executing, > > scheduler execution not paused for a reset). > > > > If I'm underatanding what you suggesting in Xe we'd create an ordered > > workqueue_struct per drm_gpu_scheduler and the queue messages on the > > ordered workqueue_struct? > > Yes, correct. > > > This seems pretty messy to me as now we have > > open coded a solution bypassing the scheduler, every drm_gpu_scheduler > > creates its own workqueue_struct, and we'd also have to open code the > > pausing of these messages for resets too. > > > > IMO this is pretty clean solution that follows the pattern of cleanup > > jobs already in place. > > Yeah, exactly that's the point. Moving the job cleanup into the scheduler > thread is seen as very very bad idea by me. > > And I really don't want to exercise that again for different use cases. > > > > > > > > > This way we wouldn't duplicate this core kernel function inside the > > > > > > scheduler. > > > > > Yeah that's essentially the design we picked for the tdr workers, > > > > > where some drivers have requirements that all tdr work must be done on > > > > > the same thread (because of cross-engine coordination issues). But > > > > > that would require that we rework the scheduler as a pile of > > > > > self-submitting work items, and I'm not sure that actually fits all > > > > > that well into the core workqueue interfaces either. > > This is the ordering between TDRs firing between different > > drm_gpu_scheduler and larger external resets (GT in Xe) an ordered > > workqueue_struct makes sense for this. Here we are talking about > > ordering jobs and messages within a single drm_gpu_scheduler. Using the > > main execution thread to do ordering makes sense in my opinion. > > I completely disagree to that. > > Take a look at how this came to be. This is a very very ugly hack and we > already had a hard time making lockdep understand the different fence > signaling dependencies with freeing the job and I'm pretty sure that is > still not 100% correct. > > > > > > > There were already patches floating around which did exactly that. > > > > > > > > Last time I checked those were actually looking pretty good. > > > > > > Link to patches for reference. > > > > > > Additional to message passing advantage the real big issue with the > > > > scheduler and 1 to 1 mapping is that we create a kernel thread for each > > > > instance, which results in tons on overhead. > > First patch in the series switches from kthread to work queue, that is > > still a good idea. > > This was the patch I was referring to. Sorry didn't remembered that this was > in the same patch set. > > > > > > > Just using a work item which is submitted to a work queue completely avoids > > > > that. > > > Hm I should have read the entire series first, since that does the > > > conversion still. Apologies for the confusion, and yeah we should be able > > > to just submit other work to the same wq with the first patch? And so > > > hand-rolling this infra here isn't needed at all? > > > > > I wouldn't call this hand rolling, rather it following patten already in > > place. > > Basically workqueues are the in kernel infrastructure for exactly that use > case and we are trying to re-create that here and that is usually a rather > bad idea. > Ok let me play around with what this would look like in Xe, what you are suggesting would be ordered-wq per scheduler, work item for run job, work item for clean up job, and work item for a message. That might work I suppose? Only issue I see is scaling as this exposes an ordered-wq creation directly to an IOCTL. No idea if that is actually a concern though. Matt > Regards, > Christian. > > > > > Matt > > > > > Or what am I missing? > > > > > > > Regards, > > > > Christian. > > > > > > > > > Worst case I think this isn't a dead-end and can be refactored to > > > > > internally use the workqueue services, with the new functions here > > > > > just being dumb wrappers until everyone is converted over. So it > > > > > doesn't look like an expensive mistake, if it turns out to be a > > > > > mistake. > > > > > -Daniel > > > > > > > > > > > > > > > > Regards, > > > > > > Christian. > > > > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx> > > > > > > > --- > > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 52 +++++++++++++++++++++++++- > > > > > > > include/drm/gpu_scheduler.h | 29 +++++++++++++- > > > > > > > 2 files changed, 78 insertions(+), 3 deletions(-) > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > index 2597fb298733..84821a124ca2 100644 > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > @@ -1049,6 +1049,49 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list, > > > > > > > } > > > > > > > EXPORT_SYMBOL(drm_sched_pick_best); > > > > > > > > > > > > > > +/** > > > > > > > + * drm_sched_add_msg - add scheduler message > > > > > > > + * > > > > > > > + * @sched: scheduler instance > > > > > > > + * @msg: message to be added > > > > > > > + * > > > > > > > + * Can and will pass an jobs waiting on dependencies or in a runnable queue. > > > > > > > + * Messages processing will stop if schedule run wq is stopped and resume when > > > > > > > + * run wq is started. > > > > > > > + */ > > > > > > > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched, > > > > > > > + struct drm_sched_msg *msg) > > > > > > > +{ > > > > > > > + spin_lock(&sched->job_list_lock); > > > > > > > + list_add_tail(&msg->link, &sched->msgs); > > > > > > > + spin_unlock(&sched->job_list_lock); > > > > > > > + > > > > > > > + drm_sched_run_wq_queue(sched); > > > > > > > +} > > > > > > > +EXPORT_SYMBOL(drm_sched_add_msg); > > > > > > > + > > > > > > > +/** > > > > > > > + * drm_sched_get_msg - get scheduler message > > > > > > > + * > > > > > > > + * @sched: scheduler instance > > > > > > > + * > > > > > > > + * Returns NULL or message > > > > > > > + */ > > > > > > > +static struct drm_sched_msg * > > > > > > > +drm_sched_get_msg(struct drm_gpu_scheduler *sched) > > > > > > > +{ > > > > > > > + struct drm_sched_msg *msg; > > > > > > > + > > > > > > > + spin_lock(&sched->job_list_lock); > > > > > > > + msg = list_first_entry_or_null(&sched->msgs, > > > > > > > + struct drm_sched_msg, link); > > > > > > > + if (msg) > > > > > > > + list_del(&msg->link); > > > > > > > + spin_unlock(&sched->job_list_lock); > > > > > > > + > > > > > > > + return msg; > > > > > > > +} > > > > > > > + > > > > > > > /** > > > > > > > * drm_sched_main - main scheduler thread > > > > > > > * > > > > > > > @@ -1060,6 +1103,7 @@ static void drm_sched_main(struct work_struct *w) > > > > > > > container_of(w, struct drm_gpu_scheduler, work_run); > > > > > > > struct drm_sched_entity *entity; > > > > > > > struct drm_sched_job *cleanup_job; > > > > > > > + struct drm_sched_msg *msg; > > > > > > > int r; > > > > > > > > > > > > > > if (READ_ONCE(sched->pause_run_wq)) > > > > > > > @@ -1067,12 +1111,15 @@ static void drm_sched_main(struct work_struct *w) > > > > > > > > > > > > > > cleanup_job = drm_sched_get_cleanup_job(sched); > > > > > > > entity = drm_sched_select_entity(sched); > > > > > > > + msg = drm_sched_get_msg(sched); > > > > > > > > > > > > > > - if (!entity && !cleanup_job) > > > > > > > + if (!entity && !cleanup_job && !msg) > > > > > > > return; /* No more work */ > > > > > > > > > > > > > > if (cleanup_job) > > > > > > > sched->ops->free_job(cleanup_job); > > > > > > > + if (msg) > > > > > > > + sched->ops->process_msg(msg); > > > > > > > > > > > > > > if (entity) { > > > > > > > struct dma_fence *fence; > > > > > > > @@ -1082,7 +1129,7 @@ static void drm_sched_main(struct work_struct *w) > > > > > > > sched_job = drm_sched_entity_pop_job(entity); > > > > > > > if (!sched_job) { > > > > > > > complete_all(&entity->entity_idle); > > > > > > > - if (!cleanup_job) > > > > > > > + if (!cleanup_job && !msg) > > > > > > > return; /* No more work */ > > > > > > > goto again; > > > > > > > } > > > > > > > @@ -1177,6 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, > > > > > > > > > > > > > > init_waitqueue_head(&sched->job_scheduled); > > > > > > > INIT_LIST_HEAD(&sched->pending_list); > > > > > > > + INIT_LIST_HEAD(&sched->msgs); > > > > > > > spin_lock_init(&sched->job_list_lock); > > > > > > > atomic_set(&sched->hw_rq_count, 0); > > > > > > > INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout); > > > > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > > > > > > > index df1993dd44ae..267bd060d178 100644 > > > > > > > --- a/include/drm/gpu_scheduler.h > > > > > > > +++ b/include/drm/gpu_scheduler.h > > > > > > > @@ -394,6 +394,23 @@ enum drm_gpu_sched_stat { > > > > > > > DRM_GPU_SCHED_STAT_ENODEV, > > > > > > > }; > > > > > > > > > > > > > > +/** > > > > > > > + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue) > > > > > > > + * message > > > > > > > + * > > > > > > > + * Generic enough for backend defined messages, backend can expand if needed. > > > > > > > + */ > > > > > > > +struct drm_sched_msg { > > > > > > > + /** @link: list link into the gpu scheduler list of messages */ > > > > > > > + struct list_head link; > > > > > > > + /** > > > > > > > + * @private_data: opaque pointer to message private data (backend defined) > > > > > > > + */ > > > > > > > + void *private_data; > > > > > > > + /** @opcode: opcode of message (backend defined) */ > > > > > > > + unsigned int opcode; > > > > > > > +}; > > > > > > > + > > > > > > > /** > > > > > > > * struct drm_sched_backend_ops - Define the backend operations > > > > > > > * called by the scheduler > > > > > > > @@ -471,6 +488,12 @@ struct drm_sched_backend_ops { > > > > > > > * and it's time to clean it up. > > > > > > > */ > > > > > > > void (*free_job)(struct drm_sched_job *sched_job); > > > > > > > + > > > > > > > + /** > > > > > > > + * @process_msg: Process a message. Allowed to block, it is this > > > > > > > + * function's responsibility to free message if dynamically allocated. > > > > > > > + */ > > > > > > > + void (*process_msg)(struct drm_sched_msg *msg); > > > > > > > }; > > > > > > > > > > > > > > /** > > > > > > > @@ -482,6 +505,7 @@ struct drm_sched_backend_ops { > > > > > > > * @timeout: the time after which a job is removed from the scheduler. > > > > > > > * @name: name of the ring for which this scheduler is being used. > > > > > > > * @sched_rq: priority wise array of run queues. > > > > > > > + * @msgs: list of messages to be processed in @work_run > > > > > > > * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler > > > > > > > * waits on this wait queue until all the scheduled jobs are > > > > > > > * finished. > > > > > > > @@ -489,7 +513,7 @@ struct drm_sched_backend_ops { > > > > > > > * @job_id_count: used to assign unique id to the each job. > > > > > > > * @run_wq: workqueue used to queue @work_run > > > > > > > * @timeout_wq: workqueue used to queue @work_tdr > > > > > > > - * @work_run: schedules jobs and cleans up entities > > > > > > > + * @work_run: schedules jobs, cleans up jobs, and processes messages > > > > > > > * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the > > > > > > > * timeout interval is over. > > > > > > > * @pending_list: the list of jobs which are currently in the job queue. > > > > > > > @@ -513,6 +537,7 @@ struct drm_gpu_scheduler { > > > > > > > long timeout; > > > > > > > const char *name; > > > > > > > struct drm_sched_rq sched_rq[DRM_SCHED_PRIORITY_COUNT]; > > > > > > > + struct list_head msgs; > > > > > > > wait_queue_head_t job_scheduled; > > > > > > > atomic_t hw_rq_count; > > > > > > > atomic64_t job_id_count; > > > > > > > @@ -566,6 +591,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity, > > > > > > > > > > > > > > void drm_sched_job_cleanup(struct drm_sched_job *job); > > > > > > > void drm_sched_wakeup(struct drm_gpu_scheduler *sched); > > > > > > > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched, > > > > > > > + struct drm_sched_msg *msg); > > > > > > > void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched); > > > > > > > void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched); > > > > > > > void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad); > > > -- > > > Daniel Vetter > > > Software Engineer, Intel Corporation > > > http://blog.ffwll.ch >