On Tue, Nov 03, 2020 at 02:19:22PM -0500, Kenny Ho wrote: > On Tue, Nov 3, 2020 at 12:43 AM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > On Mon, Nov 2, 2020 at 9:39 PM Kenny Ho <y2kenny@xxxxxxxxx> wrote: > > pls don't top post. > My apology. > > > > Cgroup awareness is desired because the intent > > > is to use this for resource management as well (potentially along with > > > other cgroup controlled resources.) I will dig into bpf_lsm and learn > > > more about it. > > > > Also consider that bpf_lsm hooks have a way to get cgroup-id without > > being explicitly scoped. So the bpf program can be made cgroup aware. > > It's just not as convenient as attaching a prog to cgroup+hook at once. > > For prototyping the existing bpf_lsm facility should be enough. > > So please try to follow this route and please share more details about > > the use case. > > Ok. I will take a look and see if that is sufficient. My > understanding of bpf-cgroup is that it not only makes attaching prog > to cgroup easier but it also facilitates hierarchical calling of > attached progs which might be useful if users wants to manage gpu > resources with bpf cgroup along with other cgroup resources (like > cpu/mem/io, etc.) Right. Hierarchical cgroup-bpf logic cannot be replicated inside the program. If you're relying on cgv2 hierarchy to containerize applications then what I suggested earlier won't work indeed. > About the use case. The high level motivation here is to provide the > ability to subdivide/share a GPU via cgroups/containers in a way that > is similar to other resources like CPU and memory. Users have been > requesting this type of functionality because GPU compute can get > expensive and they want to maximize the utilization to get the most > bang for their bucks. A traditional way to do this is via > SRIOV/virtualization but that often means time sharing the GPU as a > whole unit. That is useful for some applications but not others due > to the flushing and added latency. We also have a study that > identified various GPU compute application types. These types can > benefit from more asymmetrical/granular sharing of the GPU (for > example some applications are compute bound while others can be memory > bound that can benefit from having more VRAM.) > > I have been trying to add a cgroup subsystem for the drm subsystem for > this purpose but I ran into two challenges. First, the composition of > a GPU and how some of the subcomponents (like VRAM or shader > engines/compute units) can be shared are very much vendor specific so > we are unable to arrive at a common interface across all vendors. > Because of this and the variety of places a GPU can go into > (smartphone, PC, server, HPC), there is also no agreement on how > exactly a GPU should be shared. The best way forward appears to > simply provide hooks for users to define how and what they want to > share via a bpf program. Thank you for sharing the details. It certainly helps. > From what I can tell so far (I am still learning), there are multiple > pieces that need to fall in place for bpf-cgroup to work for this use > case. First there is resource limit enforcement, which is the > motivation for this RFC (I will look into bpf_lsm as the path > forward.) I have also been thinking about instrumenting the drm > subsystem with a new BPF program type and have various attach types > across the drm subsystem but I am not sure if this is allowed (this > one is more for resource usage monitoring.) Another thing I have been > considering is to have the gpu driver provide bpf helper functions for > bpf programs to modify drm driver internals. That was the reason I > asked about the potential of BTF support for kernel modules a couple > of months ago (and Andrii Nakryiko mentioned that it is being worked > on.) Sounds like either bpf_lsm needs to be made aware of cgv2 (which would be a great thing to have regardless) or cgroup-bpf needs a drm/gpu specific hook. I think generic ioctl hook is too broad for this use case. I suspect drm/gpu internal state would be easier to access inside bpf program if the hook is next to gpu/drm. At ioctl level there is 'file'. It's probably too abstract for the things you want to do. Like how VRAM/shader/etc can be accessed through file? Probably possible through a bunch of lookups and dereferences, but if the hook is custom to GPU that info is likely readily available. Then such cgroup-bpf check would be suitable in execution paths where ioctl-based hook would be too slow.