Re: [RFC v3 00/12] DRM scheduling cgroup controller

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Fri, 27 Jan 2023 11:40:58 +0000

On 27/01/2023 10:04, Michal Koutný wrote:
On Thu, Jan 26, 2023 at 05:57:24PM +0000, Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
So even if the RFC shows just a simple i915 implementation, the controller
itself shouldn't prevent a smarter approach (via exposed ABI).

scan/query + over budget notification is IMO limited in guarantees.

It is yes, I tried to stress out that it is not a hard guarantee in any 
shape and form and that the "quality" of adhering to the allocated 
budget will depend on individual hw and sw capabilities.

But it is what I believe is the best approach given a) how different in 
scheduling capability GPU drivers are and b) the very fact there isn't a 
central scheduling entity as opposed to the CPU side of things.

It is just no possible to do a hard guarantee system. GPUs do not 
preempt as nicely and easily as the CPUs do. And the frequency of 
context switches varies widely from too fast to "never", so again, 
charging would have several problems to overcome which would make the 
whole setup IMHO pointless.

And not only that some GPUs do not preempt nicely, but some even can't 
do any of this, period. Even if we stay within the lineage of hardware 
supported by only i915, we have three distinct categories: 1) can't do 
any of this, 2a) can do fine grained priority based scheduling with 
reasonable preemption capability, 2b) ditto but without reasonable 
preemption capability, and 3) like 2a) and 2b) but with the scheduler in 
the firmware and currently supporting coarse priority based scheduling.

Shall I also mention that a single cgroup can contain multiple GPU 
clients, all using different GPUs with a different mix of the above 
listed challenges?

The main point is, should someone prove me wrong and come up a smarter 
way at some point in the future, then "drm.weight" as an ABI remains 
compatible and the improvement can happen completely under the hood. In 
the mean time users get external control, and _some_ ability to improve 
the user experience with the scenarios such as I described yesterday.

[...]
Yes agreed, and to re-stress out, the ABI as proposed does not preclude
changing from scanning to charging or whatever. The idea was for it to be
compatible in concept with the CPU controller and also avoid baking in the
controlling method to individual drivers.
[...]

But I submit to your point of rather not exposing this via cgroup API
for possible future refinements.

Ack.

Secondly, doing this in userspace would require the ability to get some sort
of an atomic snapshot of the whole tree hierarchy to account for changes in
layout of the tree and task migrations. Or some retry logic with some added
ABI fields to enable it.

Note, that the proposed implementation is succeptible to miscount due to
concurrent tree modifications and task migrations too (scanning may not
converge under frequent cgroup layout modifications, and migrating tasks
may be summed 0 or >1 times). While in-kernel implementation may assure
the snapshot view, it'd come at cost. (Read: since the mechanism isn't
precise anyway, I don't suggest a fully synchronized scanning.)

The part that scanning may not converge in my _current implementation_ 
is true. For instance if clients would be constantly coming and going, 
for that I took a shortcut of not bothering to accumulate usage on 
process/client exit, and I just wait for a stable two periods to look at 
the current state. I reckon this is possibly okay for the real world.

Cgroup tree hierarchy modifications being the reason for not converging 
can also happen, but I thought I can hand wave that as not a realistic 
scenario. Perhaps I am not imaginative enough?

Under or over-accounting for migrating tasks I don't think can happen 
since I am explicitly handling that.

Regards,

Tvrtko