Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Danilo Krummrich <dakr@xxxxxxxxxx> · Wed, 16 Aug 2023 18:33:55 +0200

On 8/16/23 16:59, Christian König wrote:
Am 16.08.23 um 14:30 schrieb Danilo Krummrich:
On 8/16/23 16:05, Christian König wrote:
Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
Hi Matt,

On 8/11/23 04:31, Matthew Brost wrote:
In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
mapping between a drm_gpu_scheduler and drm_sched_entity. At first 
this
seems a bit odd but let us explain the reasoning below.

1. In XE the submission order from multiple drm_sched_entity is not
guaranteed to be the same completion even if targeting the same 
hardware
engine. This is because in XE we have a firmware scheduler, the GuC,
which allowed to reorder, timeslice, and preempt submissions. If a 
using
shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR 
falls
apart as the TDR expects submission order == completion order. Using a
dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

2. In XE submissions are done via programming a ring buffer (circular
buffer), a drm_gpu_scheduler provides a limit on number of jobs, if 
the
limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get 
flow
control on the ring for free.

In XE, where does the limitation of MAX_SIZE_PER_JOB come from?

In Nouveau we currently do have such a limitation as well, but it is 
derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would 
always be 1. However, I think most jobs won't actually utilize the 
whole ring.

Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = 
hw_submission_limit (or even hw_submission_limit - 1 when the hw 
can't distinct full vs empty ring buffer).

Not sure if I get you right, let me try to clarify what I was trying 
to say: I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't really 
limited by anything other than the RING_SIZE and hence we'd never 
allow more than 1 active job.

But that lets the hw run dry between submissions. That is usually a 
pretty horrible idea for performance.

Correct, that's the reason why I said it seems to be more efficient to 
base ring flow control on the actual size of each incoming job rather 
than the maximum size of a job.

However, it seems to be more efficient to base ring flow control on 
the actual size of each incoming job rather than the worst case, 
namely the maximum size of a job.

That doesn't sounds like a good idea to me. See we don't limit the 
number of submitted jobs based on the ring size, but rather we calculate 
the ring size based on the number of submitted jobs.

My point isn't really about whether we derive the ring size from the job 
limit or the other way around. It's more about the job size (or its 
maximum size) being arbitrary.

As mentioned in my reply to Matt:

"In Nouveau, userspace can submit an arbitrary amount of addresses of 
indirect bufferes containing the ring instructions. The ring on the 
kernel side takes the addresses of the indirect buffers rather than the 
instructions themself. Hence, technically there isn't really a limit on 
the amount of IBs submitted by a job except for the ring size."

So, my point is that I don't really want to limit the job size 
artificially just to be able to fit multiple jobs into the ring even if 
they're submitted at their "artificial" maximum size, but rather track 
how much of the ring the submitted job actually occupies.

In other words the hw_submission_limit defines the ring size, not the 
other way around. And you usually want the hw_submission_limit as low as 
possible for good scheduler granularity and to avoid extra overhead.

I don't think you really mean "as low as possible", do you? Because one 
really is the minimum if you want to do work at all, but as you 
mentioned above a job limit of one can let the ring run dry.

In the end my proposal comes down to tracking the actual size of a job 
rather than just assuming a pre-defined maximum job size, and hence a 
dynamic job limit.

I don't think this would hurt the scheduler granularity. In fact, it 
should even contribute to the desire of not letting the ring run dry 
even better. Especially for sequences of small jobs, where the current 
implementation might wrongly assume the ring is already full although 
actually there would still be enough space left.

Christian.

Otherwise your scheduler might just overwrite the ring buffer by 
pushing things to fast.

Christian.

Given that, it seems like it would be better to let the scheduler 
keep track of empty ring "slots" instead, such that the scheduler 
can deceide whether a subsequent job will still fit on the ring and 
if not re-evaluate once a previous job finished. Of course each 
submitted job would be required to carry the number of slots it 
requires on the ring.

What to you think of implementing this as alternative flow control 
mechanism? Implementation wise this could be a union with the 
existing hw_submission_limit.

- Danilo

A problem with this design is currently a drm_gpu_scheduler uses a
kthread for submission / job cleanup. This doesn't scale if a large
number of drm_gpu_scheduler are used. To work around the scaling 
issue,
use a worker rather than kthread for submission / job cleanup.

v2:
   - (Rob Clark) Fix msm build
   - Pass in run work queue
v3:
   - (Boris) don't have loop in worker
v4:
   - (Tvrtko) break out submit ready, stop, start helpers into own 
patch

Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>