On Wed, Nov 2, 2022 at 3:30 PM Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx> wrote: > > On Mon, 2022-10-31 at 15:25 -0700, Song Liu wrote: > > This set enables bpf programs and bpf dispatchers to share huge pages > > with > > new API: > > vmalloc_exec() > > vfree_exec() > > vcopy_exec() > > > > The idea is similar to Peter's suggestion in [1]. > > > > vmalloc_exec() manages a set of PMD_SIZE RO+X memory, and allocates > > these > > memory to its users. vfree_exec() is used to free memory allocated by > > vmalloc_exec(). vcopy_exec() is used to update memory allocated by > > vmalloc_exec(). > > > > Memory allocated by vmalloc_exec() is RO+X, so this doesnot violate > > W^X. > > The caller has to update the content with text_poke like mechanism. > > Specifically, vcopy_exec() is provided to update memory allocated by > > vmalloc_exec(). vcopy_exec() also makes sure the update stays in the > > boundary of one chunk allocated by vmalloc_exec(). Please refer to > > patch > > 1/5 for more details of > > > > Patch 3/5 uses these new APIs in bpf program and bpf dispatcher. > > > > Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to > > share > > PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved > > by > > allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use > > _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text. > > It might help to spell out what the benefits of this are. My > understanding is that (to my surprise) we actually haven't seen a > performance improvement with using 2MB pages for JITs. The main > performance benefit you saw on your previous version was from reduced > fragmentation of the direct map IIUC. This was from the effect of > reusing the same pages for JITs so that new ones don't need to be > broken. > > The other benefit of this thing is reduced shootdowns. It can load a > JIT with about only a local TLB flush on average, which should help > really high cpu systems some unknown amount. Thanks for pointing out the missing information. I don't have a benchmark that uses very big BPF programs, so the results I have don't show much benefit from fewer iTLB misses. Song