On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote: > vmalloc_exec is used to allocate memory to host dynamic kernel text > (modules, BPF programs, etc.) with huge pages. This is similar to the > proposal by Peter in [1]. This is allg reat but we need to clarify *why* we would go through the trouble. So if folks are not to excited about this series, that's probably why. IMHO it lacks substance for rationale, **and** implies a few gains without any *clear* performance metrics. I have 0 experience with mm so I'd like other's feedback on my this -- I'm just trying to do decipher rationale from prior "bpf prog pack" efforts. I'm sensing that the cables in messaging are a bit crossed here and we need to provide a bit better full picture for rationale and this is being completely missed and this work is being undersold. If my assessment is accurate though, the bpf prog pack strategy with sharing huge pages may prove useful long term for other things than just modules / ftrace / kprobes. I was surprised to see this entire patch series upgrade from RFC to proper PATCH form now completely fails to mention any of the original motivations behind the "BPF prog pack", which you are doing a true heroic effort to try to generalize as the problem is hard. Let me try to help with that. The rationale for the old BPF prog pack is documented as follows: * Most BPF programs are pretty small. Allocating a hole page for each * program is sometime a waste. Many small bpf program also adds pressure * to instruction TLB. To solve this issue, we introduce a BPF program pack * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86) * to host BPF programs. Previously you have also stated in earlier versions of this patch set: "Most BPF programs are small, but they consume a page each. For systems with busy traffic and many BPF programs, this could also add significant pressure to instruction TLB. High iTLB pressure usually causes slow down for the whole system, which includes visible performance degradation for production workloads." So it is implied here that one of the benefits is to help reduce iTLB misses. But that's it. We have no visible numbers to look at and for what... But reducing iTLB misses doesn't always have a complete direct correlation with improving things, but if the code change is small enough it obviously makes sense to apply. If the change is a bit more intrusive, as in this patch series a bit more rationale should be provided. Other than the "performance aspects" of your patchset, the *main* reason I am engaged and like it is it reduces the nasty mess of semantics on dealing with special permissions on pages which we see in modules and a few other places which today completely open code it. That proves error prone and I'm glad to see efforts to generalize that nastiness. So please ensure this is added as part of the documented rationale. Even if the iTLB miss ratio improvement is not astronomical I believe that the gains in sanity on improving semantics on special pages and sharing code make it well worthwhile. The iTLB miss ratio improvement is just a small cherry on top. Going back to performance aspects, when Linus had poked for more details about this your have elaborated further: "we have seen direct map fragmentation causing visible performance drop for our major services. This is the shadow production benchmark, so it is not possible to run it out of our data centers. Tracing showed that BPF program was the top trigger of these direct map splits." And the only other metric we have is: "For our web service production benchmark, bpf_prog_pack on 4kB pages gives 0.5% to 0.7% more throughput than not using bpf_prog_pack." These metrics are completely arbitrary and opaque to us. We need something tangible and reproducible and I have been suggesting that from early on... I'm under the impression that the real missed, undocumented, major value-add here is that the old "BPF prog pack" strategy helps to reduce the direct map fragmentation caused by heavy use of the eBPF JIT programs and this in turn helps your overall random system performance (regardless of what it is you do). As I see it then the eBPF prog pack is just one strategy to try to mitigate memory fragmentation on the direct map caused by the the eBPF JIT programs, so the "slow down" your team has obvserved should be due to the eventual fragmentation caused on the direct map *while* eBPF programs get heavily used. Mike Rapoport had presented about the Direct map fragmentation problem at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace / kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance evaluation on whether using 2M/1G pages aggressively for the kernel direct map help performance [1] ends up generally recommending huge pages. The work by Xing though was about using huge pages *alone*, not using a strategy such as in the "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs, and that I think is the real golden nugget here. I contend therefore that the theoretical reduction of iTLB misses by using huge pages for "bpf prog pack" is not what gets your systems to perform somehow better. It should be simply that it reduces fragmentation and *this* generally can help with performance long term. If this is accurate then let's please separate the two aspects to this. There's two aspects to what I would like to see from a performance perspective then actually mentioned in the commit logs: 1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution Vs not using it at all: I'd tried the below but didn't see much difference: perf stat --repeat 10 -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-stores-misses,iTLB-loads,iTLB-load-misses,page-faults \ --pre 'modprobe -r test_bpf' \-- modprobe test_bpf This is likely that the system I have might have a huge cache and I was not contending it with another heavy memory intensive workload. So it may be worthy to run something like the above *as* some other memory intensive benchark kicks gear in a loop. Another reason too may be that test_bpf runs tests sequentially instead of in parallel, and so it would be good to get ebpf folks's feedback as to an idea of what other things could be done to really eBPF JIT the hell out of a system. It is why I had suggested long ago maybe a custom new selftest which stresses the hell out of only eBPF JIT and had given an example multithreaded selftest. If we ever get modules support, I was hoping to see if something like the below (if the fs module .text section does end up shared) *may* have reduce iTLB miss ratio. perf stat --repeat 10 -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-stores-misses,iTLB-loads,iTLB-load-misses,page-faults \ --pre 'make -s mrproper defconfig' \-- make -s -j$(nproc) bzImage Maybe something as simple as systemd-analyze may show time reduction if iTLB miss ratio is reduced by sharing most module code in a huge page. 2) Estimate in reduction on direct map fragmentation by using the "bpf prog pack" or this generalized solution: For this I'd expect a benchmark similar to the workload you guys run or something memory intensive, as eBPF JITs are heavily used, and after a certain amount of time somehow compute how fragmented memory is. The only sensible thing I can think to measure memory fragmentation is to look at the memory compaction index /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's ideas as I'm a mm n00b. [0] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@xxxxxxxxxxxxxxx/ > +static void move_vmap_to_free_text_tree(void *addr) > +{ > + struct vmap_area *va; > + > + /* remove from vmap_area_root */ > + spin_lock(&vmap_area_lock); > + va = __find_vmap_area((unsigned long)addr, &vmap_area_root); > + if (WARN_ON_ONCE(!va)) { This seems to be hopeful but we purposely are allowing for the allocated memory to be used elsewhere are we not? Can you trigger the WARN_ON_ONCE() with stress-ng as described in commit 82dd23e84be3e ("mm/vmalloc.c: preload a CPU with one object for split purpose") while using eBPF JIT or loading/unloading a module in a loop? > +void *vmalloc_exec(unsigned long size, unsigned long align) > +{ > + struct vmap_area *va, *tmp; > + unsigned long addr; > + enum fit_type type; > + int ret; > + > + va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE); > + if (unlikely(!va)) > + return NULL; > + > +again: > + preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE); > + tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false); > + > + if (!tmp) { > + unsigned long alloc_size; > + void *ptr; > + > + spin_unlock(&free_text_area_lock); > + > + /* > + * Not enough continuous space in free_text_area_root, try > + * allocate more memory. The memory is first added to > + * vmap_area_root, and then moved to free_text_area_root. > + */ > + alloc_size = roundup(size, PMD_SIZE * num_online_nodes()); > + ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START, > + VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL, > + VM_ALLOW_HUGE_VMAP | VM_NO_GUARD, > + NUMA_NO_NODE, __builtin_return_address(0)); I thought Peter had suggested keeping the guard? Once you get modules going you may want to update the comment about modules on __vmalloc_node_range(). Luis