On Mon, 2022-10-31 at 15:25 -0700, Song Liu wrote: > This set enables bpf programs and bpf dispatchers to share huge pages > with > new API: > vmalloc_exec() > vfree_exec() > vcopy_exec() > > The idea is similar to Peter's suggestion in [1]. > > vmalloc_exec() manages a set of PMD_SIZE RO+X memory, and allocates > these > memory to its users. vfree_exec() is used to free memory allocated by > vmalloc_exec(). vcopy_exec() is used to update memory allocated by > vmalloc_exec(). > > Memory allocated by vmalloc_exec() is RO+X, so this doesnot violate > W^X. > The caller has to update the content with text_poke like mechanism. > Specifically, vcopy_exec() is provided to update memory allocated by > vmalloc_exec(). vcopy_exec() also makes sure the update stays in the > boundary of one chunk allocated by vmalloc_exec(). Please refer to > patch > 1/5 for more details of > > Patch 3/5 uses these new APIs in bpf program and bpf dispatcher. > > Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to > share > PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved > by > allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use > _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text. It might help to spell out what the benefits of this are. My understanding is that (to my surprise) we actually haven't seen a performance improvement with using 2MB pages for JITs. The main performance benefit you saw on your previous version was from reduced fragmentation of the direct map IIUC. This was from the effect of reusing the same pages for JITs so that new ones don't need to be broken. The other benefit of this thing is reduced shootdowns. It can load a JIT with about only a local TLB flush on average, which should help really high cpu systems some unknown amount.