Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Matthew Brost <matthew.brost@xxxxxxxxx> · Tue, 10 Jan 2023 19:01:49 +0000

On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 15:55, Matthew Brost wrote:
> > On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > > > 
> > > > 
> > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > 
> > > > [snip]
> > > > 
> > > > >       >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > > >      contexts (per
> > > > >       >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > > >      invasive changes
> > > > >       >>> to the shared code is in the spirit of the overall idea and
> > > > > instead
> > > > >       >>> opportunity should be used to look at way to refactor/improve
> > > > >      drm_sched.
> > > > > 
> > > > > 
> > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > > > or really needs to drive a re-factor.  (More on that later.)
> > > > > There's only one real issue which is that it fires off potentially a
> > > > > lot of kthreads. Even that's not that bad given that kthreads are
> > > > > pretty light and you're not likely to have more kthreads than
> > > > > userspace threads which are much heavier.  Not ideal, but not the
> > > > > end of the world either.  Definitely something we can/should
> > > > > optimize but if we went through with Xe without this patch, it would
> > > > > probably be mostly ok.
> > > > > 
> > > > >       >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > >       >>
> > > > >       >> I'm not really prepared to make large changes to DRM scheduler
> > > > >      at the
> > > > >       >> moment for Xe as they are not really required nor does Boris
> > > > >      seem they
> > > > >       >> will be required for his work either. I am interested to see
> > > > >      what Boris
> > > > >       >> comes up with.
> > > > >       >>
> > > > >       >>> Even on the low level, the idea to replace drm_sched threads
> > > > >      with workers
> > > > >       >>> has a few problems.
> > > > >       >>>
> > > > >       >>> To start with, the pattern of:
> > > > >       >>>
> > > > >       >>>    while (not_stopped) {
> > > > >       >>>     keep picking jobs
> > > > >       >>>    }
> > > > >       >>>
> > > > >       >>> Feels fundamentally in disagreement with workers (while
> > > > >      obviously fits
> > > > >       >>> perfectly with the current kthread design).
> > > > >       >>
> > > > >       >> The while loop breaks and worker exists if no jobs are ready.
> > > > > 
> > > > > 
> > > > > I'm not very familiar with workqueues. What are you saying would fit
> > > > > better? One scheduling job per work item rather than one big work
> > > > > item which handles all available jobs?
> > > > 
> > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > > > because it is a worst case / theoretical, and I think due more
> > > > fundamental concerns.
> > > > 
> > > > If we have to go back to the low level side of things, I've picked this
> > > > random spot to consolidate what I have already mentioned and perhaps
> > > > expand.
> > > > 
> > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > 
> > > > """
> > > > Generally, work items are not expected to hog a CPU and consume many
> > > > cycles. That means maintaining just enough concurrency to prevent work
> > > > processing from stalling should be optimal.
> > > > """
> > > > 
> > > > For unbound queues:
> > > > """
> > > > The responsibility of regulating concurrency level is on the users.
> > > > """
> > > > 
> > > > Given the unbound queues will be spawned on demand to service all queued
> > > > work items (more interesting when mixing up with the system_unbound_wq),
> > > > in the proposed design the number of instantiated worker threads does
> > > > not correspond to the number of user threads (as you have elsewhere
> > > > stated), but pessimistically to the number of active user contexts. That
> > > > is the number which drives the maximum number of not-runnable jobs that
> > > > can become runnable at once, and hence spawn that many work items, and
> > > > in turn unbound worker threads.
> > > > 
> > > > Several problems there.
> > > > 
> > > > It is fundamentally pointless to have potentially that many more threads
> > > > than the number of CPU cores - it simply creates a scheduling storm.
> > > 
> > > To make matters worse, if I follow the code correctly, all these per user
> > > context worker thread / work items end up contending on the same lock or
> > > circular buffer, both are one instance per GPU:
> > > 
> > > guc_engine_run_job
> > >   -> submit_engine
> > >      a) wq_item_append
> > >          -> wq_wait_for_space
> > >            -> msleep
> > 
> > a) is dedicated per xe_engine
> 
> Hah true, what its for then? I thought throttling the LRCA ring is done via:
> 

This is a per guc_id 'work queue' which is used for parallel submission
(e.g. multiple LRC tail values need to written atomically by the GuC).
Again in practice there should always be space.

>   drm_sched_init(&ge->sched, &drm_sched_ops,
> 		 e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,
> 
> Is there something more to throttle other than the ring? It is throttling
> something using msleeps..
> 
> > Also you missed the step of programming the ring which is dedicated per xe_engine
> 
> I was trying to quickly find places which serialize on something in the
> backend, ringbuffer emission did not seem to do that but maybe I missed
> something.
>

xe_ring_ops vfunc emit_job is called to write the ring.

> > 
> > >      b) xe_guc_ct_send
> > >          -> guc_ct_send
> > >            -> mutex_lock(&ct->lock);
> > >            -> later a potential msleep in h2g_has_room
> > 
> > Techincally there is 1 instance per GT not GPU, yes this is shared but
> > in practice there will always be space in the CT channel so contention
> > on the lock should be rare.
> 
> Yeah I used the term GPU to be more understandable to outside audience.
> 
> I am somewhat disappointed that the Xe opportunity hasn't been used to
> improve upon the CT communication bottlenecks. I mean those backoff sleeps
> and lock contention. I wish there would be a single thread in charge of the
> CT channel and internal users (other parts of the driver) would be able to
> send their requests to it in a more efficient manner, with less lock
> contention and centralized backoff.
>

Well the CT backend was more or less a complete rewrite. Mutexes
actually work rather well to ensure fairness compared to the spin locks
used in the i915. This code was pretty heavily reviewed by Daniel and
both of us landed a big mutex for all of the CT code compared to the 3
or 4 spin locks used in the i915.

> > I haven't read your rather long reply yet, but also FWIW using a
> > workqueue has suggested by AMD (original authors of the DRM scheduler)
> > when we ran this design by them.
> 
> Commit message says nothing about that. ;)
>

Yea I missed that, will fix in the next rev. Just dug through my emails
and Christian suggested a work queue and Andrey also gave some input on
the DRM scheduler design.

Also in the next will likely update the run_wq to be passed in by the
user.

Matt

> Regards,
> 
> Tvrtko