Re: [PATCH drm-misc-next 1/3] drm/sched: implement dynamic job flow control

Luben Tuikov <luben.tuikov@xxxxxxx> · Wed, 27 Sep 2023 08:41:51 -0400

On 2023-09-27 08:15, Christian König wrote:
> Am 27.09.23 um 14:11 schrieb Danilo Krummrich:
>> On 9/27/23 13:54, Christian König wrote:
>>> Am 26.09.23 um 09:11 schrieb Boris Brezillon:
>>>> On Mon, 25 Sep 2023 19:55:21 +0200
>>>> Christian König <christian.koenig@xxxxxxx> wrote:
>>>>
>>>>> Am 25.09.23 um 14:55 schrieb Boris Brezillon:
>>>>>> +The imagination team, who's probably interested too.
>>>>>>
>>>>>> On Mon, 25 Sep 2023 00:43:06 +0200
>>>>>> Danilo Krummrich <dakr@xxxxxxxxxx> wrote:
>>>>>>> Currently, job flow control is implemented simply by limiting the 
>>>>>>> amount
>>>>>>> of jobs in flight. Therefore, a scheduler is initialized with a
>>>>>>> submission limit that corresponds to a certain amount of jobs.
>>>>>>>
>>>>>>> This implies that for each job drivers need to account for the 
>>>>>>> maximum
>>>>>>> job size possible in order to not overflow the ring buffer.
>>>>>>>
>>>>>>> However, there are drivers, such as Nouveau, where the job size 
>>>>>>> has a
>>>>>>> rather large range. For such drivers it can easily happen that job
>>>>>>> submissions not even filling the ring by 1% can block subsequent
>>>>>>> submissions, which, in the worst case, can lead to the ring run dry.
>>>>>>>
>>>>>>> In order to overcome this issue, allow for tracking the actual 
>>>>>>> job size
>>>>>>> instead of the amount job jobs. Therefore, add a field to track a 
>>>>>>> job's
>>>>>>> submission units, which represents the amount of units a job 
>>>>>>> contributes
>>>>>>> to the scheduler's submission limit.
>>>>>> As mentioned earlier, this might allow some simplifications in the
>>>>>> PowerVR driver where we do flow-control using a dma_fence returned
>>>>>> through ->prepare_job(). The only thing that'd be missing is a way to
>>>>>> dynamically query the size of a job (a new hook?), instead of 
>>>>>> having the
>>>>>> size fixed at creation time, because PVR jobs embed native fence 
>>>>>> waits,
>>>>>> and the number of native fences will decrease if some of these fences
>>>>>> are signalled before ->run_job() is called, thus reducing the job 
>>>>>> size.
>>>>> Exactly that is a little bit questionable since it allows for the 
>>>>> device
>>>>> to postpone jobs infinitely.
>>>>>
>>>>> It would be good if the scheduler is able to validate if it's ever 
>>>>> able
>>>>> to run the job when it is pushed into the entity.
>>>> Yes, we do that already. We check that the immutable part of the job
>>>> (everything that's not a native fence wait) fits in the ringbuf.
>>>
>>> Yeah, but thinking more about it there might be really bad side 
>>> effects. We shouldn't use a callback nor job credits because it might 
>>> badly influence fairness between entities.

I feel the entity and job selection algorithm should not be influenced by
how many credits are available, but by the respective scheduling policy,
RR, FIFO, single-entity (job-FIFO).

Credits only tell us, "Can the hardware take on the next selected
job _at this time_?"

As long as,
1. the next selected job doesn't exceed the maximum job credit the hardware
   is capable of processing, and,
2. pending jobs are being completed by the hardware and returned back to
   the kernel, and their credits are added to the available credits,
we can guarantee forward progress, i.e. that the next selected job
would, in a finite time, execute and complete.

>>> In other words when one entity submits always large jobs and another 
>>> one always small ones then the scheduler would prefer the one which 
>>> submits the smaller ones because they are easier to fit into the ring 
>>> buffer.

I think we should stave off credit availability from scheduling. DRM
scheduling is based on factors which are "above the hardware", e.g. how long
an entity has been held off, how long a job therein has been waiting, etc.,
while "job credits" merely tell us if the next selected job, whichever one
it is, can be taken by the hardware _at this time_.

So, credits tell us something about the hardware. Scheduling tells
us something about entities, jobs, wait times, etc. I feel the DRM
scheduler should stay away from mixing credits and scheduling, since
that wouldn't be portable across different hardware/drivers we support,
and would make the code unnecessarily complicated and prone to bugs,
timeouts, possibly deadlocks, etc.

>>
>> That's true. Admittedly I mostly had DRM_SCHED_POLICY_SINGLE_ENTITY 
>> in mind
>> where obviously we just have a single entity.
> 
> I also only stumbled over it after thinking Boris suggestions through. 
> That problem really wasn't obvious.
> 
>>
>>>
>>> What we can do is the follow:
>>> 1. The scheduler has some initial credits it can use to push jobs.
>>> 2. Each scheduler fence (and *not* the job) has a credits field of 
>>> how much it will use.
>>> 3. After letting a a job run the credits of it's fence are subtracted 
>>> from the available credits of the scheduler.
>>> 4. The scheduler can keep running jobs as long as it has a positive 
>>> credit count.
>>> 5. When the credit count becomes negative it goes to sleep until a 
>>> scheduler fence signals and the count becomes positive again.
>>
>> Wouldn't it be possible that we overflow the ring with that or at 
>> least end up in
>> a busy wait in run_job()? What if the remaining credit count is 
>> greater than 0 but
>> smaller than the number of credits the next picked job requires?

That's fine and this is _exactly_ the purpose of job credits--you want
to know if you can push the next job _now_ or if you should wait until
some pending jobs in hardware finish executing in order to free resources
in the hardware, so that you _can_ push the next picked job down to
the hardware.

> 
> The initial credits the scheduler gets should be only halve the ring size.
> 
> So as long as that is positive you should have enough space even for the 
> biggest jobs.
> 
> We should still have a warning if userspace tries to push something 
> bigger into an entity.

Right. Generally, if userspace tries to push a job which exceeds the capability
of the hardware, as measured by job credits which are set by the hardware,
then emit a warning. 
-- 
Regards,
Luben

> 
> Regards,
> Christian.
> 
>>
>>>
>>> This way jobs are handled equally, you can still push jobs up to at 
>>> least halve your ring buffer size and you should be able to handle 
>>> your PowerVR case by calculating the credits you actually used in 
>>> your run_job() callback.
>>>
>>> As far as I can see that approach should work, shouldn't it?
>>>
>>> Regards,
>>> Christian.
>>>
>>
>