Hello, On Mon, Oct 29, 2018 at 07:49:13PM -0400, Kenny Ho wrote: > Unlike a computer, however, GPUs are not as well "standardized" as a > desktop or a server. For the gpu/drm subsystem, there are something > that are common (such as buffer sharing and buffer lifetime > management), something that are shared by some vendors (software > scheduler) and something that are very much vendor specific. Due to > this, a generic cgroup controller for drm may need to be more > pluggable than other cgroup controller. We took a look at the rdma > cgroup as part of our research but rdma appears to have resources that > are more abstracted and standardized. > > What do you think? Does drm/gpu warrant its own full-fledged cgroup > controller? First of all, the summary is much appreciated. Here are my two cents. I think it could help a lot to think about what eventual features users would want instead of specific hardware details. The hardwares might not be all that standardized but what users would want in terms of resource control wouldn't vary much - e.g. "this is more important, that's less, but I don't want to leave the device idle while there's work to do" or "this guy paid me X while that guy Y, let's make sure each gets what they paid for". Rather than trying to build up the interface from what each device can do, trying building it down from high level user needs, I believe, has a better chance of reaching an interface which a wide audience would find useful and can stand the test of the time. IOW, make the interface about the user intentions rather than underlying implementation details. In the long term, this would also help us (kernel devs) as implementation details aren't locked into widely used interface. Also, I wouldn't recommend using rdma as the benchmark. While rdma has abstract standard resources defined and the controller distributes numeric amounts of them, what they mean to users is poorly defined, unintuitive or difficult to use. That was what rdma could do given the circumstances of the area (it's really difficult to define work or cost metric for IO devices) but I think gpus have a lot better chance of reaching something which is a lot more meaningful and useful. As the first (rather challenging) step, what's likely to be the most useful to the widest audience is work-conserving proportional control - gpu.weight. Implementation strategy can differ across gpu vendors but the concept is as universal as it gets - A should be able to do X times more work than B. Thanks. -- tejun