Re: [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

John Harrison <john.c.harrison@xxxxxxxxx> · Wed, 11 Jan 2023 10:52:54 -0800

On 1/11/2023 10:07, Matthew Brost wrote:
On Wed, Jan 11, 2023 at 09:17:01AM +0000, Tvrtko Ursulin wrote:
On 10/01/2023 19:01, Matthew Brost wrote:
On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
On 10/01/2023 15:55, Matthew Brost wrote:
On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
On 10/01/2023 11:28, Tvrtko Ursulin wrote:
On 09/01/2023 17:27, Jason Ekstrand wrote:

[snip]

        >>> AFAICT it proposes to have 1:1 between *userspace* created
       contexts (per
        >>> context _and_ engine) and drm_sched. I am not sure avoiding
       invasive changes
        >>> to the shared code is in the spirit of the overall idea and
instead
        >>> opportunity should be used to look at way to refactor/improve
       drm_sched.

Maybe?  I'm not convinced that what Xe is doing is an abuse at all
or really needs to drive a re-factor.  (More on that later.)
There's only one real issue which is that it fires off potentially a
lot of kthreads. Even that's not that bad given that kthreads are
pretty light and you're not likely to have more kthreads than
userspace threads which are much heavier.  Not ideal, but not the
end of the world either.  Definitely something we can/should
optimize but if we went through with Xe without this patch, it would
probably be mostly ok.

        >> Yes, it is 1:1 *userspace* engines and drm_sched.
        >>
        >> I'm not really prepared to make large changes to DRM scheduler
       at the
        >> moment for Xe as they are not really required nor does Boris
       seem they
        >> will be required for his work either. I am interested to see
       what Boris
        >> comes up with.
        >>
        >>> Even on the low level, the idea to replace drm_sched threads
       with workers
        >>> has a few problems.
        >>>
        >>> To start with, the pattern of:
        >>>
        >>>    while (not_stopped) {
        >>>     keep picking jobs
        >>>    }
        >>>
        >>> Feels fundamentally in disagreement with workers (while
       obviously fits
        >>> perfectly with the current kthread design).
        >>
        >> The while loop breaks and worker exists if no jobs are ready.

I'm not very familiar with workqueues. What are you saying would fit
better? One scheduling job per work item rather than one big work
item which handles all available jobs?
Yes and no, it indeed IMO does not fit to have a work item which is
potentially unbound in runtime. But it is a bit moot conceptual mismatch
because it is a worst case / theoretical, and I think due more
fundamental concerns.

If we have to go back to the low level side of things, I've picked this
random spot to consolidate what I have already mentioned and perhaps
expand.

To start with, let me pull out some thoughts from workqueue.rst:

"""
Generally, work items are not expected to hog a CPU and consume many
cycles. That means maintaining just enough concurrency to prevent work
processing from stalling should be optimal.
"""

For unbound queues:
"""
The responsibility of regulating concurrency level is on the users.
"""

Given the unbound queues will be spawned on demand to service all queued
work items (more interesting when mixing up with the system_unbound_wq),
in the proposed design the number of instantiated worker threads does
not correspond to the number of user threads (as you have elsewhere
stated), but pessimistically to the number of active user contexts. That
is the number which drives the maximum number of not-runnable jobs that
can become runnable at once, and hence spawn that many work items, and
in turn unbound worker threads.

Several problems there.

It is fundamentally pointless to have potentially that many more threads
than the number of CPU cores - it simply creates a scheduling storm.
To make matters worse, if I follow the code correctly, all these per user
context worker thread / work items end up contending on the same lock or
circular buffer, both are one instance per GPU:

guc_engine_run_job
    -> submit_engine
       a) wq_item_append
           -> wq_wait_for_space
             -> msleep
a) is dedicated per xe_engine
Hah true, what its for then? I thought throttling the LRCA ring is done via:

This is a per guc_id 'work queue' which is used for parallel submission
(e.g. multiple LRC tail values need to written atomically by the GuC).
Again in practice there should always be space.
Speaking of guc id, where does blocking when none are available happen in
the non parallel case?

We have 64k guc_ids on native, 1k guc_ids with 64k VFs. Either way we
think that is more than enough and can just reject xe_engine creation if
we run out of guc_ids. If this proves to false, we can fix this but the
guc_id stealing the i915 is rather complicated and hopefully not needed.

We will limit the number of guc_ids allowed per user pid to reasonible
number to prevent a DoS. Elevated pids (e.g. IGTs) will be able do to
whatever they want.
What about doorbells? As some point, we will have to start using those 
and they are a much more limited resource - 256 total and way less with VFs.

John.