Re: [PATCH drm-misc-next 1/3] drm/sched: implement dynamic job flow control

Danilo Krummrich <dakr@xxxxxxxxxx> · Wed, 27 Sep 2023 13:45:37 +0200

On 9/27/23 09:25, Boris Brezillon wrote:
On Wed, 27 Sep 2023 02:13:59 +0200
Danilo Krummrich <dakr@xxxxxxxxxx> wrote:

On 9/26/23 22:43, Luben Tuikov wrote:
Hi,

On 2023-09-24 18:43, Danilo Krummrich wrote:
Currently, job flow control is implemented simply by limiting the amount
of jobs in flight. Therefore, a scheduler is initialized with a
submission limit that corresponds to a certain amount of jobs.

"certain"? How about this instead:
" ... that corresponds to the number of jobs which can be sent
    to the hardware."?

This implies that for each job drivers need to account for the maximum
                                  ^,
Please add a comma after "job".

job size possible in order to not overflow the ring buffer.

Well, different hardware designs would implement this differently.
Ideally, you only want pointers into the ring buffer, and then
the hardware consumes as much as it can. But this is a moot point
and it's always a good idea to have a "job size" hint from the client.
So this is a good patch.

Ideally, you want to say that the hardware needs to be able to
accommodate the number of jobs which can fit in the hardware
queue times the largest job. This is a waste of resources
however, and it is better to give a hint as to the size of a job,
by the client. If the hardware can peek and understand dependencies,
on top of knowing the "size of the job", it can be an extremely
efficient scheduler.

However, there are drivers, such as Nouveau, where the job size has a
rather large range. For such drivers it can easily happen that job
submissions not even filling the ring by 1% can block subsequent
submissions, which, in the worst case, can lead to the ring run dry.

In order to overcome this issue, allow for tracking the actual job size
instead of the amount job jobs. Therefore, add a field to track a job's

"the amount job jobs." --> "the number of jobs."

Yeah, I somehow manage to always get this wrong, which I guess you noticed
below already.

That's all good points below - gonna address them.

Did you see Boris' response regarding a separate callback in order to fetch
the job's submission units dynamically? Since this is needed by PowerVR, I'd
like to include this in V2. What's your take on that?

My only concern with that would be that if I got what Boris was saying
correctly calling

WARN_ON(s_job->submission_units > sched->submission_limit);

from drm_sched_can_queue() wouldn't work anymore, since this could indeed happen
temporarily. I think this was also Christian's concern.

Actually, I think that's fine to account for the max job size in the
first check, we're unlikely to have so many native fence waits that our
job can't fit in an empty ring buffer.

But it can happen, right? Hence, we can't have this check, do we?