From: Daniel Borkmann <daniel@xxxxxxxxxxxxx> Date: Mon, 20 Mar 2023 15:37:25 +0100 > We've seen recent AWS EKS (Kubernetes) user reports like the following: > > After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS > clusters after a few days a number of the nodes have containers stuck > in ContainerCreating state or liveness/readiness probes reporting the > following error: > > Readiness probe errored: rpc error: code = Unknown desc = failed to > exec in container: failed to start exec "4a11039f730203ffc003b7[...]": > OCI runtime exec failed: exec failed: unable to start container process: > unable to init seccomp: error loading seccomp filter into kernel: > error loading seccomp filter: errno 524: unknown > > However, we had not been seeing this issue on previous AMIs and it only > started to occur on v20230217 (following the upgrade from kernel 5.4 to > 5.10) with no other changes to the underlying cluster or workloads. > > We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) > which helped to immediately allow containers to be created and probes to > execute but after approximately a day the issue returned and the value > returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' > was steadily increasing. > > I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem > their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, > seccomp BPF and native (e)BPF programs, and the behavior all looks sane > and expected, that is nothing "leaking" from an upstream perspective. > > The bpf_jit_limit knob was originally added in order to avoid a situation > where unprivileged applications loading BPF programs (e.g. seccomp BPF > policies) consuming all the module memory space via BPF JIT such that loading > of kernel modules would be prevented. The default limit was defined back in > 2018 and while good enough back then, we are generally seeing far more BPF > consumers today. > > Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the > module memory space to better reflect today's needs and avoid more users > running into potentially hard to debug issues. > > Fixes: fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") > Reported-by: Stephen Haynes <sh@xxxxxxxx> > Reported-by: Lefteris Alexakis <lefteris.alexakis@xxxxxxx> > Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx> Hi Daniel, Thanks for tha patch. Reviewed-by: Kuniyuki Iwashima <kuniyu@xxxxxxxxxx> > Link: https://github.com/awslabs/amazon-eks-ami/issues/1179 > Link: https://github.com/awslabs/amazon-eks-ami/issues/1219 I'm investigating these issues with EKS folks. On the issue 1179, the customer was using our 5.4 kernel, and on 1219, 5.10 kernel. Then, I found my memleak fix commit a1140cb215fa ("seccomp: Move copy_seccomp() to no failure path.") was not backported to upstream 5.10 stable trees. We'll test if the issue can be reproduced with/without the fix. Anyway, I'll backport this patch to our all trees. Thanks, Kuniyuki > --- > kernel/bpf/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index b297e9f60ca1..e2d256c82072 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -972,7 +972,7 @@ static int __init bpf_jit_charge_init(void) > { > /* Only used as heuristic here to derive limit. */ > bpf_jit_limit_max = bpf_jit_alloc_exec_limit(); > - bpf_jit_limit = min_t(u64, round_up(bpf_jit_limit_max >> 2, > + bpf_jit_limit = min_t(u64, round_up(bpf_jit_limit_max >> 1, > PAGE_SIZE), LONG_MAX); > return 0; > } > -- > 2.27.0