On 8/16/23 16:38, Matthew Brost wrote:
On Wed, Aug 16, 2023 at 02:30:38PM +0200, Danilo Krummrich wrote:
On 8/16/23 16:05, Christian König wrote:
Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
Hi Matt,
On 8/11/23 04:31, Matthew Brost wrote:
In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
seems a bit odd but let us explain the reasoning below.
1. In XE the submission order from multiple drm_sched_entity is not
guaranteed to be the same completion even if targeting the same hardware
engine. This is because in XE we have a firmware scheduler, the GuC,
which allowed to reorder, timeslice, and preempt submissions. If a using
shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
apart as the TDR expects submission order == completion order. Using a
dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
2. In XE submissions are done via programming a ring buffer (circular
buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
control on the ring for free.
In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
In Xe the job submission is series of ring instructions done by the KMD.
The instructions are cache flushes, seqno writes, jump to user BB,
etc... The exact instructions for each job vary based on hw engine type,
platform, etc... We dervive MAX_SIZE_PER_JOB from the largest set of
instructions to submit a job and have a define in the driver for this. I
believe it is currently set to 192 bytes (the actual define is
MAX_JOB_SIZE_BYTES). So a 16k ring lets Xe have 85 jobs inflight at
once.
Ok, that sounds different to how Nouveau works. The "largest set of
instructions to submit a job" really is a given by how the hardware
works rather than an arbitrary limit?
In Nouveau, userspace can submit an arbitrary amount of addresses of
indirect bufferes containing the ring instructions. The ring on the
kernel side takes the addresses of the indirect buffers rather than the
instructions themself. Hence, technically there isn't really a limit on
the amount of IBs submitted by a job except for the ring size.
In Nouveau we currently do have such a limitation as well, but it is
derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would
always be 1. However, I think most jobs won't actually utilize the
whole ring.
Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB =
hw_submission_limit (or even hw_submission_limit - 1 when the hw can't
Yes, hw_submission_limit = RING_SIZE / MAX_SIZE_PER_JOB in Xe.
distinct full vs empty ring buffer).
Not sure if I get you right, let me try to clarify what I was trying to say:
I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't really limited by
anything other than the RING_SIZE and hence we'd never allow more than 1
active job.
I'm confused how there isn't a limit on the size of the job in Nouveau?
Based on what you have said, a job could be larger than the ring then?
As explained above, theoretically it could. It's only limited by the
ring size.
However, it seems to be more efficient to base ring flow control on the
actual size of each incoming job rather than the worst case, namely the
maximum size of a job.
If this doesn't work for Nouveau, feel free flow control the ring
differently but this works rather well (and simple) for Xe.
Matt
Otherwise your scheduler might just overwrite the ring buffer by pushing
things to fast.
Christian.
Given that, it seems like it would be better to let the scheduler
keep track of empty ring "slots" instead, such that the scheduler
can deceide whether a subsequent job will still fit on the ring and
if not re-evaluate once a previous job finished. Of course each
submitted job would be required to carry the number of slots it
requires on the ring.
What to you think of implementing this as alternative flow control
mechanism? Implementation wise this could be a union with the
existing hw_submission_limit.
- Danilo
A problem with this design is currently a drm_gpu_scheduler uses a
kthread for submission / job cleanup. This doesn't scale if a large
number of drm_gpu_scheduler are used. To work around the scaling issue,
use a worker rather than kthread for submission / job cleanup.
v2:
- (Rob Clark) Fix msm build
- Pass in run work queue
v3:
- (Boris) don't have loop in worker
v4:
- (Tvrtko) break out submit ready, stop, start helpers into own patch
Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>