Re: [RFC 00/17] DRM scheduling cgroup controller

Tejun Heo <tj@xxxxxxxxxx> · Mon, 31 Oct 2022 10:20:18 -1000

Hello,

On Thu, Oct 27, 2022 at 03:32:00PM +0100, Tvrtko Ursulin wrote:
> Looking at what's available in cgroups world now, I have spotted the
> blkio.prio.class control. For my current use case (lower GPU scheduling of
> background/unfocused windows) that would also work. Even if starting with
> just two possible values - 'no-change' and 'idle' (to follow the IO
> controller naming).

I wouldn't follow that example. That's only meaningful within the context of
bfq and it probabaly shouldn't have been merged in the first place.

> How would you view that as a proposal? It would be a bit tougher "sell" to
> the DRM community, perhaps, given that many drivers do have scheduling
> priority, but the concept of scheduling class is not really there.
> Nevertheless a concept of normal-vs-background could be plausible in my
> mind. It could be easily implemented by using the priority controls
> available in many drivers.

I don't feel great about that.

* The semantics aren't clearly defined. While not immediately obvious in the
  interface, the task nice levels have definite mappings to weight values
  and thus clear meanings. I don't think it's a good idea to leave the
  interface semantics open to interpretation.

* Maybe GPUs are better but my experience with optional hardware features in
  the storage world has been that vendors diverge wildly and unexpectedly to
  the point many features are mostly useless. There are fewer GPU vendors
  and more software effort behind each, so maybe the situation is better but
  I think it'd be helpul to keep some skepticism.

* Even when per-vendor or per-driver features are consistent enough to be
  useful, they often aren't thought through enough to be truly useful. e.g.
  nvme has priority features but they aren't really that useful because they
  can't do much without congestion control on the issuer side and once you
  have congestion control on the issuer side which is usually a lot more
  complex (e.g. dealing with work-conserving hierarchical weight
  distributions, priority inversions and so on), you can achieve most of
  what you need in terms of resource control from the issuer side anyway.

So, I'd much prefer to have a fuller solution on the kernel side which
integrates per-vendor/driver features where they make sense.

> > >    drm.budget_supported
> > > 	One of:
> > > 	 1) 'yes' - when all DRM clients in the group support the functionality.
> > > 	 2) 'no' - when at least one of the DRM clients does not support the
> > > 		   functionality.
> > > 	 3) 'n/a' - when there are no DRM clients in the group.
> > 
> > Yeah, I'm not sure about this. This isn't a per-cgroup property to begin
> > with and I'm not sure 'no' meaning at least one device not supporting is
> > intuitive. The distinction between 'no' and 'n/a' is kinda weird too. Please
> > drop this.
> 
> The idea actually is for this to be per-cgroup and potentially change
> dynamically. It implements the concept of "observability", how I have,
> perhaps clumsily, named it.
> 
> This is because we can have a mix of DRM file descriptors in a cgroup, not
> all of which support the proposed functionality. So I wanted to have
> something by which the administrator can observe the status of the group.
> 
> For instance seeing some clients do not support the feature could be signal
> that things have been misconfigured, or that appeal needs to be made for
> driver X to start supporting the feature. Seeing a "no" there in other words
> is a signal that budgeting may not really work as expected and needs to be
> investigated.

I still don't see how this is per-cgroup given that it's indicating whether
the driver supports some feature. Also, the eventual goal would be
supporting the same control mechanisms across most (if not all) GPUs, right?

> > Rather than doing it hierarchically on the spot, it's usually a lot cheaper
> > and easier to calculate the flattened hierarchical weight per leaf cgroup
> > and divide the bandwidth according to the eventual portions. For an example,
> > please take a look at block/blk-iocost.c.
> 
> This seems exactly what I had in mind (but haven't implemented it yet). So
> in this RFC I have budget splitting per group where each tree level adds up
> to "100%" and the thing which I have not implemented is "borrowing" or
> yielding (how blk-iocost.c calls it, or donating) unused budgets to
> siblings.
> 
> I am very happy to hear my idea is the right one and someone already
> implemented it. Thanks for this pointer!

The budget donation thing in iocost is necessary only because it wants to
make the hot path local to the cgroup because io control has to support very
high decision rate. For time-slicing GPU, it's likely that following the
current hierarchical weight on the spot is enough.

Thanks.

-- 
tejun