On Wed, Aug 10, 2022 at 11:13 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > On our production environment, we may load, run and pin bpf programs and > maps in containers. For example, some of our networking bpf programs and > maps are loaded and pinned by a process running in a container on our > k8s environment. In this container, there're also running some other > user applications which watch the networking configurations from remote > servers and update them on this local host, log the error events, monitor > the traffic, and do some other stuffs. Sometimes we may need to update > these user applications to a new release, and in this update process we > will destroy the old container and then start a new genration. In order not > to interrupt the bpf programs in the update process, we will pin the bpf > programs and maps in bpffs. That is the background and use case on our > production environment. > > After switching to memcg-based bpf memory accounting to limit the bpf > memory, some unexpected issues jumped out at us. > 1. The memory usage is not consistent between the first generation and > new generations. > 2. After the first generation is destroyed, the bpf memory can't be > limited if the bpf maps are not preallocated, because they will be > reparented. > > This patchset tries to resolve these issues by introducing an > independent memcg to limit the bpf memory. > > In the bpf map creation, we can assign a specific memcg instead of using > the current memcg. That makes it flexible in containized environment. > For example, if we want to limit the pinned bpf maps, we can use below > hierarchy, > > Shared resources Private resources > > bpf-memcg k8s-memcg > / \ / > bpf-bar-memcg bpf-foo-memcg srv-foo-memcg > | / \ > (charged) (not charged) (charged) > | / \ > | / \ > bpf-foo-{progs, maps} srv-foo > > srv-foo loads and pins bpf-foo-{progs, maps}, but they are charged to an > independent memcg (bpf-foo-memcg) instead of srv-foo's memcg > (srv-foo-memcg). > > Pls. note that there may be no process in bpf-foo-memcg, that means it > can be rmdir-ed by root user currently. Meanwhile we don't forcefully > destroy a memcg if it doesn't have any residents. So this hierarchy is > acceptible. > > In order to make the memcg of bpf maps seletectable, this patchset > introduces some memory allocation wrappers to allocate map related > memory. In these wrappers, it will get the memcg from the map and then > charge the allocated pages or objs. > > Currenly it only supports for bpf map, and we can extend it to bpf prog > as well. It only supports for cgroup2 now, but we can make an additional > change in cgroup_get_from_fd() to support it for cgroup1. > > The observebility can also be supported in the next step, for example, > showing the bpf map's memcg by 'bpftool map show' or even showing which > maps are charged to a specific memcg by 'bpftool cgroup show'. > Furthermore, we may also show an accurate memory size of a bpf map > instead of an estimated memory size in 'bpftool map show' in the future. > > RFC->v1: > - get rid of bpf_map container wrapper (Alexei) > - add the new field into the end of struct (Alexei) > - get rid of BPF_F_SELECTABLE_MEMCG (Alexei) > - save memcg in bpf_map_init_from_attr > - introduce bpf_ringbuf_pages_{alloc,free} and keep them inside > kernel/bpf/ringbuf.c (Andrii) > > Yafang Shao (15): > bpf: Remove unneeded memset in queue_stack_map creation > bpf: Use bpf_map_area_free instread of kvfree > bpf: Make __GFP_NOWARN consistent in bpf map creation > bpf: Use bpf_map_area_alloc consistently on bpf map creation > bpf: Fix incorrect mem_cgroup_put > bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM > bpf: Call bpf_map_init_from_attr() immediately after map creation > bpf: Save memcg in bpf_map_init_from_attr() > bpf: Use scoped-based charge in bpf_map_area_alloc > bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} > bpf: Use bpf_map_kzalloc in arraymap > bpf: Use bpf_map_kvcalloc in bpf_local_storage > mm, memcg: Add new helper get_obj_cgroup_from_cgroup > bpf: Add return value for bpf_map_init_from_attr > bpf: Introduce selectable memcg for bpf map > > include/linux/bpf.h | 43 ++++++++++++- > include/linux/memcontrol.h | 11 ++++ > include/uapi/linux/bpf.h | 1 + > kernel/bpf/arraymap.c | 34 ++++++----- > kernel/bpf/bloom_filter.c | 11 +++- > kernel/bpf/bpf_local_storage.c | 17 ++++-- > kernel/bpf/bpf_struct_ops.c | 19 +++--- > kernel/bpf/cpumap.c | 17 ++++-- > kernel/bpf/devmap.c | 30 ++++++---- > kernel/bpf/hashtab.c | 26 ++++---- > kernel/bpf/local_storage.c | 12 ++-- > kernel/bpf/lpm_trie.c | 12 +++- > kernel/bpf/offload.c | 12 ++-- > kernel/bpf/queue_stack_maps.c | 13 ++-- > kernel/bpf/reuseport_array.c | 11 +++- > kernel/bpf/ringbuf.c | 104 ++++++++++++++++++++++---------- > kernel/bpf/stackmap.c | 13 ++-- > kernel/bpf/syscall.c | 133 ++++++++++++++++++++++++++++------------- > mm/memcontrol.c | 41 +++++++++++++ > net/core/sock_map.c | 30 ++++++---- > net/xdp/xskmap.c | 12 +++- > tools/include/uapi/linux/bpf.h | 1 + > tools/lib/bpf/bpf.c | 3 +- > tools/lib/bpf/bpf.h | 3 +- > tools/lib/bpf/gen_loader.c | 2 +- > tools/lib/bpf/libbpf.c | 2 + > tools/lib/bpf/skel_internal.h | 2 +- > 27 files changed, 436 insertions(+), 179 deletions(-) > > -- > 1.8.3.1 > Ah, this series is incomplete. Pls see the update one. https://lore.kernel.org/bpf/20220810151840.16394-1-laoar.shao@xxxxxxxxx/T/#t -- Regards Yafang