Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

"Edgecombe, Rick P" <rick.p.edgecombe@xxxxxxxxx> · Tue, 19 Apr 2022 01:56:03 +0000

On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote:
> > There are use-cases that require 4K pages with non-default
> > permissions in
> > the direct map and the pages not necessarily should be executable.
> > There
> > were several suggestions to implement caches of 4K pages backed by
> > 2M
> > pages.
> 
> Even if we just focus on the executable side of the story... there
> may
> be users who can share this too.
> 
> I've gone down memory lane now at least down to year 2005 in kprobes
> to see why the heck module_alloc() was used. At first glance there
> are
> some old comments about being within the 2 GiB text kernel range...
> But
> some old tribal knowledge is still lost. The real hints come from
> kprobe work
> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of
> line
> - take2"), so that the "For the %rip-relative displacement fixups to
> be
> doable"... but this got me wondering, would other users who *do* want
> similar funcionality benefit from a cache. If the space is limited
> then
> using a cache makes sense. Specially if architectures tend to require
> hacks for some of this to all work.

Yea, that was my understanding. X86 modules have to be linked within
2GB of the kernel text, also eBPF x86 JIT generates code that expects
to be within 2GB of the kernel text.

I think of two types of caches we could have: caches of unmapped pages
on the direct map and caches of virtual memory mappings. Caches of
pages on the direct map reduce breakage of the large pages (and is
somewhat x86 specific problem). Caches of virtual memory mappings
reduce shootdowns, and are also required to share huge pages. I'll plug
my old RFC, where I tried to work towards enabling both:

https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@xxxxxxxxx/

Since then Mike has taken a lot further the direct map cache piece.

Yea, probably a lot of JIT's are way smaller than a page, but there is
also hopefully some performance benefit of reduced ITLB pressure and
TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps
its own cache of a page for putting very small trampolines.

> 
> Then, since it seems since the vmalloc area was not initialized,
> wouldn't that break the old JIT spray fixes, refer to commit
> 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying
> attacks")?

Hmm, yea it might be a way to get around the ebpf jit rlimit. The
allocator could just text_poke() invalid instructions on "free" of the
jit.

> 
> Is that sort of work not needed anymore? If in doubt I at least made
> the
> old proof of concept JIT spray stuff compile on recent kernels [0],
> but
> I haven't tried out your patches yet. If this is not needed anymore,
> why not?

IIRC this got addressed in two ways, randomizing of the jit offset
inside the vmalloc allocation, and "constant blinding", such that the
specific attack of inserting unaligned instructions as immediate
instruction data did not work. Neither of those mitigations seem
unworkable with a large page caching allocator.

> 
> The collection of tribal knowedge around these sorts of things would
> be
> good to not loose and if we can share, even better.

Totally agree here. I think the abstraction I was exploring in that RFC
could remove some of the special permission memory tribal knowledge
that is lurking in in the cross-arch module.c. I wonder if you have any
thoughts on something like that? The normal modules proved the hardest.