Re: [PATCH bpf-next v1 RESEND 0/5] vmalloc_exec for modules and BPF programs

Song Liu <song@xxxxxxxxxx> · Thu, 3 Nov 2022 14:13:53 -0700

On Wed, Nov 2, 2022 at 3:30 PM Edgecombe, Rick P
<rick.p.edgecombe@xxxxxxxxx> wrote:
>
> On Mon, 2022-10-31 at 15:25 -0700, Song Liu wrote:
> > This set enables bpf programs and bpf dispatchers to share huge pages
> > with
> > new API:
> >   vmalloc_exec()
> >   vfree_exec()
> >   vcopy_exec()
> >
> > The idea is similar to Peter's suggestion in [1].
> >
> > vmalloc_exec() manages a set of PMD_SIZE RO+X memory, and allocates
> > these
> > memory to its users. vfree_exec() is used to free memory allocated by
> > vmalloc_exec(). vcopy_exec() is used to update memory allocated by
> > vmalloc_exec().
> >
> > Memory allocated by vmalloc_exec() is RO+X, so this doesnot violate
> > W^X.
> > The caller has to update the content with text_poke like mechanism.
> > Specifically, vcopy_exec() is provided to update memory allocated by
> > vmalloc_exec(). vcopy_exec() also makes sure the update stays in the
> > boundary of one chunk allocated by vmalloc_exec(). Please refer to
> > patch
> > 1/5 for more details of
> >
> > Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.
> >
> > Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to
> > share
> > PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved
> > by
> > allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
> > _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.
>
> It might help to spell out what the benefits of this are. My
> understanding is that (to my surprise) we actually haven't seen a
> performance improvement with using 2MB pages for JITs. The main
> performance benefit you saw on your previous version was from reduced
> fragmentation of the direct map IIUC. This was from the effect of
> reusing the same pages for JITs so that new ones don't need to be
> broken.
>
> The other benefit of this thing is reduced shootdowns. It can load a
> JIT with about only a local TLB flush on average, which should help
> really high cpu systems some unknown amount.

Thanks for pointing out the missing information.

I don't have a benchmark that uses very big BPF programs, so the
results I have don't show much benefit from fewer iTLB misses.

Song