Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL

Kenny Ho <y2kenny@xxxxxxxxx> · Tue, 3 Nov 2020 14:19:22 -0500

On Tue, Nov 3, 2020 at 12:43 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
> On Mon, Nov 2, 2020 at 9:39 PM Kenny Ho <y2kenny@xxxxxxxxx> wrote:
> pls don't top post.
My apology.

> > Cgroup awareness is desired because the intent
> > is to use this for resource management as well (potentially along with
> > other cgroup controlled resources.)  I will dig into bpf_lsm and learn
> > more about it.
>
> Also consider that bpf_lsm hooks have a way to get cgroup-id without
> being explicitly scoped. So the bpf program can be made cgroup aware.
> It's just not as convenient as attaching a prog to cgroup+hook at once.
> For prototyping the existing bpf_lsm facility should be enough.
> So please try to follow this route and please share more details about
> the use case.

Ok.  I will take a look and see if that is sufficient.  My
understanding of bpf-cgroup is that it not only makes attaching prog
to cgroup easier but it also facilitates hierarchical calling of
attached progs which might be useful if users wants to manage gpu
resources with bpf cgroup along with other cgroup resources (like
cpu/mem/io, etc.)

About the use case.  The high level motivation here is to provide the
ability to subdivide/share a GPU via cgroups/containers in a way that
is similar to other resources like CPU and memory.  Users have been
requesting this type of functionality because GPU compute can get
expensive and they want to maximize the utilization to get the most
bang for their bucks.  A traditional way to do this is via
SRIOV/virtualization but that often means time sharing the GPU as a
whole unit.  That is useful for some applications but not others due
to the flushing and added latency.  We also have a study that
identified various GPU compute application types.  These types can
benefit from more asymmetrical/granular sharing of the GPU (for
example some applications are compute bound while others can be memory
bound that can benefit from having more VRAM.)

I have been trying to add a cgroup subsystem for the drm subsystem for
this purpose but I ran into two challenges.  First, the composition of
a GPU and how some of the subcomponents (like VRAM or shader
engines/compute units) can be shared are very much vendor specific so
we are unable to arrive at a common interface across all vendors.
Because of this and the variety of places a GPU can go into
(smartphone, PC, server, HPC), there is also no agreement on how
exactly a GPU should be shared.  The best way forward appears to
simply provide hooks for users to define how and what they want to
share via a bpf program.

>From what I can tell so far (I am still learning), there are multiple
pieces that need to fall in place for bpf-cgroup to work for this use
case.  First there is resource limit enforcement, which is the
motivation for this RFC (I will look into bpf_lsm as the path
forward.)  I have also been thinking about instrumenting the drm
subsystem with a new BPF program type and have various attach types
across the drm subsystem but I am not sure if this is allowed (this
one is more for resource usage monitoring.)  Another thing I have been
considering is to have the gpu driver provide bpf helper functions for
bpf programs to modify drm driver internals.  That was the reason I
asked about the potential of BTF support for kernel modules a couple
of months ago (and Andrii Nakryiko mentioned that it is being worked
on.)

Please feel free to ask more questions if any of the above is unclear.
Feedbacks are always welcome.

Regards,
Kenny