Re: [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 11/01/2023 17:52, Matthew Brost wrote:
On Wed, Jan 11, 2023 at 09:09:45AM +0000, Tvrtko Ursulin wrote:

[snip]

Anyway, since you are not buying any arguments on paper perhaps you are more
open towards testing. If you would adapt gem_wsim for Xe you would be able
to spawn N simulated transcode sessions on any Gen11+ machine and try it
out.

For example:

gem_wsim -w benchmarks/wsim/media_load_balance_fhd26u7.wsim -c 36 -r 600

That will run you 36 parallel transcoding sessions streams for 600 frames
each. No client setup needed whatsoever apart from compiling IGT.

In the past that was quite a handy tool to identify scheduling issues, or
validate changes against. All workloads with the media prefix have actually
been hand crafted by looking at what real media pipelines do with real data.
Few years back at least.


Porting this is non-trivial as this is 2.5k. Also in Xe we are trending
to use UMD benchmarks to determine if there are performance problems as
in the i915 we had tons microbenchmarks / IGT benchmarks that we found
meant absolutely nothing. Can't say if this benchmark falls into that
category.

I explained what it does so it was supposed to be obvious it is not a micro benchmark.

2.5k what, lines of code? Difficulty of adding Xe support does not scale with LOC but with how much it uses the kernel API. You'd essentially need to handle context/engine creation and different execbuf.

It's not trivial no, but it would save you downloading gigabytes of test streams, building a bunch of tools and libraries etc, and so overall in my experience it *significantly* improves the driver development turn-around time.

We VK and compute benchmarks running and haven't found any major issues
yet. The media UMD hasn't been ported because of the VM bind dependency
so I can't say if there are any issues with the media UMD + Xe.

What I can do hack up xe_exec_threads to really hammer Xe - change it to
128x xe_engines + 8k execs per thread. Each exec is super simple, it
just stores a dword. It creates a thread per hardware engine, so on TGL
this is 5x threads.

Results below:
root@DUT025-TGLU:mbrost# xe_exec_threads --r threads-basic
IGT-Version: 1.26-ge26de4b2 (x86_64) (Linux: 6.1.0-rc1-xe+ x86_64)
Starting subtest: threads-basic
Subtest threads-basic: SUCCESS (1.215s)
root@DUT025-TGLU:mbrost# dumptrace | grep job | wc
   40960  491520 7401728
root@DUT025-TGLU:mbrost# dumptrace | grep engine | wc
     645    7095   82457

So with 640 xe_engines (5x are VM engines) it takes 1.215 seconds test
time to run 40960 execs. That seems to indicate we do not have a
scheduling problem.

This is 8 core (or at least 8 threads) TGL:

root@DUT025-TGLU:mbrost# cat /proc/cpuinfo
...
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 140
model name      : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
stepping        : 1
microcode       : 0x3a
cpu MHz         : 2344.098
cache size      : 12288 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
...

Enough data to be convinced there is not issue with this design? I can
also hack up Xe to use less GPU schedulers /w a kthreads but again that
isn't trivial and doesn't seem necessary based on these results.

Not yet. It's not only about how many somethings per second you can do. It is also about what effect to the rest of the system it creates.

Anyway I think you said in different sub-thread you will move away from system_wq, so we can close this one. With that plan at least I don't have to worry my mouse will stutter and audio glitch while Xe is churning away.

Regards,

Tvrtko



[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux