Re: bpf_jit_limit close shave

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Thu, 23 Sep 2021 13:52:05 +0200

On 9/23/21 11:16 AM, Lorenz Bauer wrote:
On Wed, 22 Sept 2021 at 22:51, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
On 9/22/21 1:07 PM, Lorenz Bauer wrote:
On Wed, 22 Sept 2021 at 09:20, Frank Hofmann <fhofmann@xxxxxxxxxxxxxx> wrote:

That jit limit is not there on older kernels and doesn't apply to root.
How would you notice such a kernel bug in such conditions?

I'm talking about bpf_jit_current - it's an "overall gauge" for
allocation, priv and unpriv. I understood Lorenz' note as "change it
so it only tracks unpriv BPF mem usage - since we'll never act on
privileged usage anyway"

Yes, that was my suggestion indeed. What Frank is saying: it looks
like our leak of JIT memory is due to a privileged process. By
exempting privileged processes it would be even harder to notice /
debug. That's true, and brings me back to my question: what is
different about JIT memory that we can't do a better limit?

The knob with the limit was basically added back then as a band-aid to avoid
unprivileged BPF JIT (cBPF or eBPF) eating up all the module memory to the
point where we cannot even load kernel modules anymore. Given that memory
resource is global, we added the bpf_jit_limit / bpf_jit_current acounting
as a fix/heuristic via ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict
unpriv allocations"). If we wouldn't account for root, how would such detection
proposal work otherwise to block unprivileged? I don't think it's feasible to
only account the latter given privileged progs might have occupied most of the
budget already.

Thanks, that was the part I was missing. JITed BPF programs are
treated like modules (why?). There is a limited space reserved for
kernel modules.

See bpf_jit_alloc_exec() which calls module_alloc() for the images' r+x memory
holding the generated opcodes, and there's only one such pool for the system
on the latter: on x86 in particular, the rationale for module_alloc() use is
so that the image is guaranteed to be within +/- 2GB of where the kernel image
resides. See the encoding of BPF_CALL with __bpf_call_base + imm32, for example.

How does the knob solve the "can't load a new module" problem if our
suggestion / preference is to steer people towards CAP_BPF anyways
(since unpriv BPF is trouble)? Over time all BPF will be privileged
and we're in the same mess again?

Keep in mind that the knob was added before CAP_BPF. In general, unprivileged
cBPF->eBPF is also using the same bpf_jit_alloc_exec() for the JIT, so that
needs to be taken into consideration as well, but if you grant an application
CAP_BPF then you're essentially privileged. The knob's point was to prevent
fully unprivileged users to play bad games.

Thanks,
Daniel