On Fri, May 10, 2019 at 8:31 AM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > I think it is a good approach to try to add a global limit first and > when that's working go ahead with limiting device specific resources. What are some of the global drm resource limit/allocation that would be useful to implement? I would be happy to dig into those. Regards, Kenny > The only major issue I can see is on patch #4, see there for further > details. > > Christian. > > Am 09.05.19 um 23:04 schrieb Kenny Ho: > > This is a follow up to the RFC I made last november to introduce a cgroup controller for the GPU/DRM subsystem [a]. The goal is to be able to provide resource management to GPU resources using things like container. The cover letter from v1 is copied below for reference. > > > > Usage examples: > > // set limit for card1 to 1GB > > sed -i '2s/.*/1073741824/' /sys/fs/cgroup/<cgroup>/drm.buffer.total.max > > > > // set limit for card0 to 512MB > > sed -i '1s/.*/536870912/' /sys/fs/cgroup/<cgroup>/drm.buffer.total.max > > > > > > v2: > > * Removed the vendoring concepts > > * Add limit to total buffer allocation > > * Add limit to the maximum size of a buffer allocation > > > > TODO: process migration > > TODO: documentations > > > > [a]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html > > > > v1: cover letter > > > > The purpose of this patch series is to start a discussion for a generic cgroup > > controller for the drm subsystem. The design proposed here is a very early one. > > We are hoping to engage the community as we develop the idea. > > > > > > Backgrounds > > ========== > > Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of > > tasks, and all their future children, into hierarchical groups with specialized > > behaviour, such as accounting/limiting the resources which processes in a cgroup > > can access[1]. Weights, limits, protections, allocations are the main resource > > distribution models. Existing cgroup controllers includes cpu, memory, io, > > rdma, and more. cgroup is one of the foundational technologies that enables the > > popular container application deployment and management method. > > > > Direct Rendering Manager/drm contains code intended to support the needs of > > complex graphics devices. Graphics drivers in the kernel may make use of DRM > > functions to make tasks like memory management, interrupt handling and DMA > > easier, and provide a uniform interface to applications. The DRM has also > > developed beyond traditional graphics applications to support compute/GPGPU > > applications. > > > > > > Motivations > > ========= > > As GPU grow beyond the realm of desktop/workstation graphics into areas like > > data center clusters and IoT, there are increasing needs to monitor and regulate > > GPU as a resource like cpu, memory and io. > > > > Matt Roper from Intel began working on similar idea in early 2018 [2] for the > > purpose of managing GPU priority using the cgroup hierarchy. While that > > particular use case may not warrant a standalone drm cgroup controller, there > > are other use cases where having one can be useful [3]. Monitoring GPU > > resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU > > (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help > > sysadmins get a better understanding of the applications usage profile. Further > > usage regulations of the aforementioned resources can also help sysadmins > > optimize workload deployment on limited GPU resources. > > > > With the increased importance of machine learning, data science and other > > cloud-based applications, GPUs are already in production use in data centers > > today [5,6,7]. Existing GPU resource management is very course grain, however, > > as sysadmins are only able to distribute workload on a per-GPU basis [8]. An > > alternative is to use GPU virtualization (with or without SRIOV) but it > > generally acts on the entire GPU instead of the specific resources in a GPU. > > With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU > > resource management (in addition to what may be available via GPU > > virtualization.) > > > > In addition to production use, the DRM cgroup can also help with testing > > graphics application robustness by providing a mean to artificially limit DRM > > resources availble to the applications. > > > > Challenges > > ======== > > While there are common infrastructure in DRM that is shared across many vendors > > (the scheduler [4] for example), there are also aspects of DRM that are vendor > > specific. To accommodate this, we borrowed the mechanism used by the cgroup to > > handle different kinds of cgroup controller. > > > > Resources for DRM are also often device (GPU) specific instead of system > > specific and a system may contain more than one GPU. For this, we borrowed some > > of the ideas from RDMA cgroup controller. > > > > Approach > > ======= > > To experiment with the idea of a DRM cgroup, we would like to start with basic > > accounting and statistics, then continue to iterate and add regulating > > mechanisms into the driver. > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt > > [2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html > > [3] https://www.spinics.net/lists/cgroups/msg20720.html > > [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler > > [5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ > > [6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-strom-in-openshift-3-10/ > > [7] https://github.com/RadeonOpenCompute/k8s-device-plugin > > [8] https://github.com/kubernetes/kubernetes/issues/52757 > > > > Kenny Ho (5): > > cgroup: Introduce cgroup for drm subsystem > > cgroup: Add mechanism to register DRM devices > > drm/amdgpu: Register AMD devices for DRM cgroup > > drm, cgroup: Add total GEM buffer allocation limit > > drm, cgroup: Add peak GEM buffer allocation limit > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 4 + > > drivers/gpu/drm/drm_gem.c | 7 + > > drivers/gpu/drm/drm_prime.c | 9 + > > include/drm/drm_cgroup.h | 54 +++ > > include/drm/drm_gem.h | 11 + > > include/linux/cgroup_drm.h | 47 ++ > > include/linux/cgroup_subsys.h | 4 + > > init/Kconfig | 5 + > > kernel/cgroup/Makefile | 1 + > > kernel/cgroup/drm.c | 497 +++++++++++++++++++++ > > 11 files changed, 643 insertions(+) > > create mode 100644 include/drm/drm_cgroup.h > > create mode 100644 include/linux/cgroup_drm.h > > create mode 100644 kernel/cgroup/drm.c > > > _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel