On 27/01/2023 10:04, Michal Koutný wrote:
On Thu, Jan 26, 2023 at 05:57:24PM +0000, Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
So even if the RFC shows just a simple i915 implementation, the controller
itself shouldn't prevent a smarter approach (via exposed ABI).
scan/query + over budget notification is IMO limited in guarantees.
It is yes, I tried to stress out that it is not a hard guarantee in any
shape and form and that the "quality" of adhering to the allocated
budget will depend on individual hw and sw capabilities.
But it is what I believe is the best approach given a) how different in
scheduling capability GPU drivers are and b) the very fact there isn't a
central scheduling entity as opposed to the CPU side of things.
It is just no possible to do a hard guarantee system. GPUs do not
preempt as nicely and easily as the CPUs do. And the frequency of
context switches varies widely from too fast to "never", so again,
charging would have several problems to overcome which would make the
whole setup IMHO pointless.
And not only that some GPUs do not preempt nicely, but some even can't
do any of this, period. Even if we stay within the lineage of hardware
supported by only i915, we have three distinct categories: 1) can't do
any of this, 2a) can do fine grained priority based scheduling with
reasonable preemption capability, 2b) ditto but without reasonable
preemption capability, and 3) like 2a) and 2b) but with the scheduler in
the firmware and currently supporting coarse priority based scheduling.
Shall I also mention that a single cgroup can contain multiple GPU
clients, all using different GPUs with a different mix of the above
listed challenges?
The main point is, should someone prove me wrong and come up a smarter
way at some point in the future, then "drm.weight" as an ABI remains
compatible and the improvement can happen completely under the hood. In
the mean time users get external control, and _some_ ability to improve
the user experience with the scenarios such as I described yesterday.
[...]
Yes agreed, and to re-stress out, the ABI as proposed does not preclude
changing from scanning to charging or whatever. The idea was for it to be
compatible in concept with the CPU controller and also avoid baking in the
controlling method to individual drivers.
[...]
But I submit to your point of rather not exposing this via cgroup API
for possible future refinements.
Ack.
Secondly, doing this in userspace would require the ability to get some sort
of an atomic snapshot of the whole tree hierarchy to account for changes in
layout of the tree and task migrations. Or some retry logic with some added
ABI fields to enable it.
Note, that the proposed implementation is succeptible to miscount due to
concurrent tree modifications and task migrations too (scanning may not
converge under frequent cgroup layout modifications, and migrating tasks
may be summed 0 or >1 times). While in-kernel implementation may assure
the snapshot view, it'd come at cost. (Read: since the mechanism isn't
precise anyway, I don't suggest a fully synchronized scanning.)
The part that scanning may not converge in my _current implementation_
is true. For instance if clients would be constantly coming and going,
for that I took a shortcut of not bothering to accumulate usage on
process/client exit, and I just wait for a stable two periods to look at
the current state. I reckon this is possibly okay for the real world.
Cgroup tree hierarchy modifications being the reason for not converging
can also happen, but I thought I can hand wave that as not a realistic
scenario. Perhaps I am not imaginative enough?
Under or over-accounting for migrating tasks I don't think can happen
since I am explicitly handling that.
Regards,
Tvrtko