Re: [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Matthew Brost <matthew.brost@xxxxxxxxx> · Wed, 11 Jan 2023 17:52:35 +0000

On Wed, Jan 11, 2023 at 09:09:45AM +0000, Tvrtko Ursulin wrote:
> 
> On 11/01/2023 01:13, Matthew Brost wrote:
> > On Tue, Jan 10, 2023 at 04:39:00PM +0000, Matthew Brost wrote:
> > > On Tue, Jan 10, 2023 at 11:28:08AM +0000, Tvrtko Ursulin wrote:
> > > > 
> > > > 
> > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > 
> > > > [snip]
> > > > 
> > > > >       >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > > >      contexts (per
> > > > >       >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > > >      invasive changes
> > > > >       >>> to the shared code is in the spirit of the overall idea and instead
> > > > >       >>> opportunity should be used to look at way to refactor/improve
> > > > >      drm_sched.
> > > > > 
> > > > > 
> > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > > > > really needs to drive a re-factor.  (More on that later.)  There's only
> > > > > one real issue which is that it fires off potentially a lot of kthreads.
> > > > > Even that's not that bad given that kthreads are pretty light and you're
> > > > > not likely to have more kthreads than userspace threads which are much
> > > > > heavier.  Not ideal, but not the end of the world either.  Definitely
> > > > > something we can/should optimize but if we went through with Xe without
> > > > > this patch, it would probably be mostly ok.
> > > > > 
> > > > >       >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > >       >>
> > > > >       >> I'm not really prepared to make large changes to DRM scheduler
> > > > >      at the
> > > > >       >> moment for Xe as they are not really required nor does Boris
> > > > >      seem they
> > > > >       >> will be required for his work either. I am interested to see
> > > > >      what Boris
> > > > >       >> comes up with.
> > > > >       >>
> > > > >       >>> Even on the low level, the idea to replace drm_sched threads
> > > > >      with workers
> > > > >       >>> has a few problems.
> > > > >       >>>
> > > > >       >>> To start with, the pattern of:
> > > > >       >>>
> > > > >       >>>    while (not_stopped) {
> > > > >       >>>     keep picking jobs
> > > > >       >>>    }
> > > > >       >>>
> > > > >       >>> Feels fundamentally in disagreement with workers (while
> > > > >      obviously fits
> > > > >       >>> perfectly with the current kthread design).
> > > > >       >>
> > > > >       >> The while loop breaks and worker exists if no jobs are ready.
> > > > > 
> > > > > 
> > > > > I'm not very familiar with workqueues. What are you saying would fit
> > > > > better? One scheduling job per work item rather than one big work item
> > > > > which handles all available jobs?
> > > > 
> > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > > > because it is a worst case / theoretical, and I think due more fundamental
> > > > concerns.
> > > > 
> > > > If we have to go back to the low level side of things, I've picked this
> > > > random spot to consolidate what I have already mentioned and perhaps expand.
> > > > 
> > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > 
> > > > """
> > > > Generally, work items are not expected to hog a CPU and consume many cycles.
> > > > That means maintaining just enough concurrency to prevent work processing
> > > > from stalling should be optimal.
> > > > """
> > > > 
> > > > For unbound queues:
> > > > """
> > > > The responsibility of regulating concurrency level is on the users.
> > > > """
> > > > 
> > > > Given the unbound queues will be spawned on demand to service all queued
> > > > work items (more interesting when mixing up with the system_unbound_wq), in
> > > > the proposed design the number of instantiated worker threads does not
> > > > correspond to the number of user threads (as you have elsewhere stated), but
> > > > pessimistically to the number of active user contexts. That is the number
> > > > which drives the maximum number of not-runnable jobs that can become
> > > > runnable at once, and hence spawn that many work items, and in turn unbound
> > > > worker threads.
> > > > 
> > > > Several problems there.
> > > > 
> > > > It is fundamentally pointless to have potentially that many more threads
> > > > than the number of CPU cores - it simply creates a scheduling storm.
> > > > 
> > > 
> > > We can use a different work queue if this is an issue, have a FIXME
> > > which indicates we should allow the user to pass in the work queue.
> > > 
> > > > Unbound workers have no CPU / cache locality either and no connection with
> > > > the CPU scheduler to optimize scheduling patterns. This may matter either on
> > > > large systems or on small ones. Whereas the current design allows for
> > > > scheduler to notice userspace CPU thread keeps waking up the same drm
> > > > scheduler kernel thread, and so it can keep them on the same CPU, the
> > > > unbound workers lose that ability and so 2nd CPU might be getting woken up
> > > > from low sleep for every submission.
> > > > 
> > > 
> > > I guess I don't understand kthread vs. workqueue scheduling internals.
> > 
> > Looked into this and we are not using unbound workers rather we are just
> > using the system_wq which is indeed bound. Again we can change this so a
> > user can just pass in worker too. After doing a of research bound
> > workers allows the scheduler to use locality too avoid that exact
> > problem your reading.
> > 
> > TL;DR I'm not buying any of these arguments although it is possible I am
> > missing something.
> 
> Well you told me it's using unbound.. message id
> Y7dEjcuc1arHBTGu@xxxxxxxxxxxxxxxxxxxxxxxx:
> 
> """
> Right now the system_unbound_wq is used which does have a limit on the
> number of threads, right? I do have a FIXME to allow a worker to be
> passed in similar to TDR.
> """
> 

Yea, my mistake. A quick look at the shows we are using system_wq (same
as TDR).

> With bound workers you will indeed get CPU locality. I am not sure what it
> will do in terms of concurrency. If it will serialize work items to fewer
> spawned workers that will be good for the CT contention issue, but may
> negatively affect latency. And possibly preemption / time slicing decisions
> since the order of submitting to the backend will not be in the order of
> context priority, hence high prio may be submitted right after low and
> immediately trigger preemption.
>

We should probably use system_highpri_wq for high priority contexts
(xe_engine).

> Anyway, since you are not buying any arguments on paper perhaps you are more
> open towards testing. If you would adapt gem_wsim for Xe you would be able
> to spawn N simulated transcode sessions on any Gen11+ machine and try it
> out.
> 
> For example:
> 
> gem_wsim -w benchmarks/wsim/media_load_balance_fhd26u7.wsim -c 36 -r 600
> 
> That will run you 36 parallel transcoding sessions streams for 600 frames
> each. No client setup needed whatsoever apart from compiling IGT.
> 
> In the past that was quite a handy tool to identify scheduling issues, or
> validate changes against. All workloads with the media prefix have actually
> been hand crafted by looking at what real media pipelines do with real data.
> Few years back at least.
> 

Porting this is non-trivial as this is 2.5k. Also in Xe we are trending
to use UMD benchmarks to determine if there are performance problems as
in the i915 we had tons microbenchmarks / IGT benchmarks that we found
meant absolutely nothing. Can't say if this benchmark falls into that
category.

We VK and compute benchmarks running and haven't found any major issues
yet. The media UMD hasn't been ported because of the VM bind dependency
so I can't say if there are any issues with the media UMD + Xe.

What I can do hack up xe_exec_threads to really hammer Xe - change it to
128x xe_engines + 8k execs per thread. Each exec is super simple, it
just stores a dword. It creates a thread per hardware engine, so on TGL
this is 5x threads.

Results below:
root@DUT025-TGLU:mbrost# xe_exec_threads --r threads-basic
IGT-Version: 1.26-ge26de4b2 (x86_64) (Linux: 6.1.0-rc1-xe+ x86_64)
Starting subtest: threads-basic
Subtest threads-basic: SUCCESS (1.215s)
root@DUT025-TGLU:mbrost# dumptrace | grep job | wc
  40960  491520 7401728
root@DUT025-TGLU:mbrost# dumptrace | grep engine | wc
    645    7095   82457

So with 640 xe_engines (5x are VM engines) it takes 1.215 seconds test
time to run 40960 execs. That seems to indicate we do not have a
scheduling problem.

This is 8 core (or at least 8 threads) TGL:

root@DUT025-TGLU:mbrost# cat /proc/cpuinfo
...
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 140
model name      : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
stepping        : 1
microcode       : 0x3a
cpu MHz         : 2344.098
cache size      : 12288 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
...

Enough data to be convinced there is not issue with this design? I can
also hack up Xe to use less GPU schedulers /w a kthreads but again that
isn't trivial and doesn't seem necessary based on these results.

> It could show you real world behaviour of the kworkers approach and it could
> also enable you to cross reference any power and performance changes
> relative to i915. Background story there is that media servers like to fit N
> streams to a server and if a change comes along which suddenly makes only
> N-1 stream fit before dropping out of realtime, that's a big problem.
> 
> If you will believe me there is value in that kind of testing I am happy to
> help you add Xe support to the tool, time permitting so possibly guidance
> only at the moment.

If we want to port the tool I wont stop you and provide support if you
struggle with the uAPI but based on my results above I don't think this
is necessary.

Matt

> 
> Regards,
> 
> Tvrtko