Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 10 Jan 2023 12:19:35 +0000

On 10/01/2023 11:28, Tvrtko Ursulin wrote:

On 09/01/2023 17:27, Jason Ekstrand wrote:

[snip]

     >>> AFAICT it proposes to have 1:1 between *userspace* created
    contexts (per
     >>> context _and_ engine) and drm_sched. I am not sure avoiding
    invasive changes
     >>> to the shared code is in the spirit of the overall idea and 
instead
     >>> opportunity should be used to look at way to refactor/improve
    drm_sched.

Maybe?  I'm not convinced that what Xe is doing is an abuse at all or 
really needs to drive a re-factor.  (More on that later.)  There's 
only one real issue which is that it fires off potentially a lot of 
kthreads. Even that's not that bad given that kthreads are pretty 
light and you're not likely to have more kthreads than userspace 
threads which are much heavier.  Not ideal, but not the end of the 
world either.  Definitely something we can/should optimize but if we 
went through with Xe without this patch, it would probably be mostly ok.

     >> Yes, it is 1:1 *userspace* engines and drm_sched.
     >>
     >> I'm not really prepared to make large changes to DRM scheduler
    at the
     >> moment for Xe as they are not really required nor does Boris
    seem they
     >> will be required for his work either. I am interested to see
    what Boris
     >> comes up with.
     >>
     >>> Even on the low level, the idea to replace drm_sched threads
    with workers
     >>> has a few problems.
     >>>
     >>> To start with, the pattern of:
     >>>
     >>>    while (not_stopped) {
     >>>     keep picking jobs
     >>>    }
     >>>
     >>> Feels fundamentally in disagreement with workers (while
    obviously fits
     >>> perfectly with the current kthread design).
     >>
     >> The while loop breaks and worker exists if no jobs are ready.

I'm not very familiar with workqueues. What are you saying would fit 
better? One scheduling job per work item rather than one big work item 
which handles all available jobs?

Yes and no, it indeed IMO does not fit to have a work item which is 
potentially unbound in runtime. But it is a bit moot conceptual mismatch 
because it is a worst case / theoretical, and I think due more 
fundamental concerns.

If we have to go back to the low level side of things, I've picked this 
random spot to consolidate what I have already mentioned and perhaps 
expand.

To start with, let me pull out some thoughts from workqueue.rst:

"""
Generally, work items are not expected to hog a CPU and consume many 
cycles. That means maintaining just enough concurrency to prevent work 
processing from stalling should be optimal.
"""

For unbound queues:
"""
The responsibility of regulating concurrency level is on the users.
"""

Given the unbound queues will be spawned on demand to service all queued 
work items (more interesting when mixing up with the system_unbound_wq), 
in the proposed design the number of instantiated worker threads does 
not correspond to the number of user threads (as you have elsewhere 
stated), but pessimistically to the number of active user contexts. That 
is the number which drives the maximum number of not-runnable jobs that 
can become runnable at once, and hence spawn that many work items, and 
in turn unbound worker threads.

Several problems there.

It is fundamentally pointless to have potentially that many more threads 
than the number of CPU cores - it simply creates a scheduling storm.

To make matters worse, if I follow the code correctly, all these per 
user context worker thread / work items end up contending on the same 
lock or circular buffer, both are one instance per GPU:

guc_engine_run_job
 -> submit_engine
    a) wq_item_append
        -> wq_wait_for_space
          -> msleep
    b) xe_guc_ct_send
        -> guc_ct_send
          -> mutex_lock(&ct->lock);
          -> later a potential msleep in h2g_has_room

Regards,

Tvrtko