Hello Xie! It's actually not a surprise, it's a known limitation/exception. Partially it was so because historically there was no way to account percpu memory, and some bpf maps can are using it quite extensively. Fortunately, it changed recently, and 5.9 will likely get an ability to account percpu memory. The latest version of the patchset I've actually sent today: https://lore.kernel.org/linux-mm/20200623184515.4132564-1-guro@xxxxxx/T/#m0be45dd71e6a238985181c213d9934731949c089 I also have a patchset in work which adds a memcg accounting of bpf memory (programs and maps). I plan to send it upstream on the next week. If everything will go smoothly it might appear in 5.9 as well. Unfortunately the magnitude of required changes does not allow to backport these changes to older kernels. Thanks! PS I'll be completely offline till the end of the week. I'll respond all e-mail on Monday, Jun 29th. Thanks! On Wed, Jun 24, 2020 at 03:46:58AM +0000, Xie Xun wrote: > Hello, > > I found that programs can consume much more memory than memcg limit by setting BPF for many times. It's because that allocations during setting BPF are not charged by memcg. > > > Below is how I did it: > > 1. Run Linux kernel in a QEMU virtual machine (x86_64) with 1GB physical memory. > The kernel is built with memcg and memcg kmem accounting enabled. > > 2. Create a docker (runC) container, with memory limit 100MB. > > docker run --name debian --memory 100000000 --kernel-memory 50000000 \ > debian:slim /bin/bash > > 3. In the container, run a program to set BPF for many times. I use prctl to set BPF. > > while(1) > { > prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf); > } > > 4. Physical memory usage(the one by `free` or `top`) is increased by around 40MB, > but memory usage of the container's memcg doesn't increase a lot (around 100KB). > > 5. Run several processes to set BPF, and almost all physical memory is consumed. > Sometimes some processes not in the container are also killed due to OOM. > > I also try this with user namespace on, and I can still kill host processes inside container in this way. So this problem may be dangerous for containers that based on cgroups. > > > kernel version: 5.3.6 > kernel configuration: in attachment (CONFIG_MEMCG_KMEM is on) > > > This blog also shows this problem: https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.xiexun.tech_break-2Dmemcg.html&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=IBhsN9u88bNDFoDHNutIMKB-YrCvCOIvw-8z9RpB8RI&s=O1b3udJv7obq8vZ88-YPEDzs7hhGov3o_Txskn4IeyA&e= > > > Cause of this problem: > > Memory allocations during setting BPF are not charged by memcg. For example, > in kernel/bpf/core.c:bpf_prog_alloc, bpf_prog_alloc_no_stats and alloc_percpu_gfp > are called to allocate memory. However, neither of them are charged by memcg. > So if we trigger this path for many times, we can consume lots of memory, without > increasing our memcg usage. > > /* ------------ */ > struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) > { > gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags; > struct bpf_prog *prog; > int cpu; > > prog = bpf_prog_alloc_no_stats(size, gfp_extra_flags); > if (!prog) > return NULL; > > prog->aux->stats = alloc_percpu_gfp(struct bpf_prog_stats, gfp_flags); > > /* ... */ > > } > /* ------------ */ > > > My program that sets BPF: > > /* ------------ */ > #include <unistd.h> > #include <sys/prctl.h> > #include <linux/prctl.h> > #include <linux/seccomp.h> > #include <linux/filter.h> > #include <linux/audit.h> > #include <linux/signal.h> > #include <sys/ptrace.h> > #include <stdio.h> > #include <errno.h> > > int main() > { > struct sock_filter insns[] = > { > { > .code = 0x6, > .jt = 0, > .jf = 0, > .k = SECCOMP_RET_ALLOW > } > }; > struct sock_fprog bpf = > { > .len = 1, > .filter = insns > }; > int ret; > > ret = prctl(PR_SET_NO_NEW_PRIVS, 1, NULL, 0, 0); > if (ret) > { > printf("error1 %d\n", errno); > return 1; > } > int count = 0; > while (1) > { > ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf); > if (ret) > { > sleep(1); > printf("error %d\n", errno); > } > else > { > count++; > printf("ok %d\n", count); > } > } > return 0; > } > /* ------------ */ > > > Thanks, > Xie Xun