On Thu, Nov 03, 2022 at 05:51:57PM +0200, Mike Rapoport wrote: > Hi Luis, > > Thanks for looping me in. > > On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote: > > On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote: > > > vmalloc_exec is used to allocate memory to host dynamic kernel text > > > (modules, BPF programs, etc.) with huge pages. This is similar to the > > > proposal by Peter in [1]. > > > > This is allg reat but we need to clarify *why* we would go through the > > trouble. So if folks are not to excited about this series, that's > > probably why. IMHO it lacks substance for rationale, **and** implies a few > > gains without any *clear* performance metrics. I have 0 experience with > > mm so I'd like other's feedback on my this -- I'm just trying to do > > decipher rationale from prior "bpf prog pack" efforts. > > > > I'm sensing that the cables in messaging are a bit crossed here and we need > > to provide a bit better full picture for rationale and this is being > > completely missed and this work is being undersold. If my assessment is > > accurate though, the bpf prog pack strategy with sharing huge pages may prove > > useful long term for other things than just modules / ftrace / kprobes. > > > > I was surprised to see this entire patch series upgrade from RFC to proper > > PATCH form now completely fails to mention any of the original motivations > > behind the "BPF prog pack", which you are doing a true heroic effort to try to > > generalize as the problem is hard. Let me try to help with that. The rationale > > for the old BPF prog pack is documented as follows: > > > > * Most BPF programs are pretty small. Allocating a hole page for each > > * program is sometime a waste. Many small bpf program also adds pressure > > * to instruction TLB. To solve this issue, we introduce a BPF program pack > > * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86) > > * to host BPF programs. > > > > Previously you have also stated in earlier versions of this patch set: > > > > "Most BPF programs are small, but they consume a page each. For systems > > with busy traffic and many BPF programs, this could also add significant > > pressure to instruction TLB. High iTLB pressure usually causes slow down > > for the whole system, which includes visible performance > > degradation for production workloads." > > > > So it is implied here that one of the benefits is to help reduce iTLB misses. > > But that's it. We have no visible numbers to look at and for what... But > > reducing iTLB misses doesn't always have a complete direct correlation > > with improving things, but if the code change is small enough it obviously > > makes sense to apply. If the change is a bit more intrusive, as in this > > patch series a bit more rationale should be provided. > > > > Other than the "performance aspects" of your patchset, the *main* reason > > I am engaged and like it is it reduces the nasty mess of semantics on > > dealing with special permissions on pages which we see in modules and a > > few other places which today completely open code it. That proves error > > prone and I'm glad to see efforts to generalize that nastiness. So > > please ensure this is added as part of the documented rationale. Even > > if the iTLB miss ratio improvement is not astronomical I believe that > > the gains in sanity on improving semantics on special pages and sharing code > > make it well worthwhile. The iTLB miss ratio improvement is just a small > > cherry on top. > > > > Going back to performance aspects, when Linus had poked for more details > > about this your have elaborated further: > > > > "we have seen direct map fragmentation causing visible > > performance drop for our major services. This is the shadow > > production benchmark, so it is not possible to run it out of > > our data centers. Tracing showed that BPF program was the top > > trigger of these direct map splits." > > > > And the only other metric we have is: > > > > "For our web service production benchmark, bpf_prog_pack on 4kB pages > > gives 0.5% to 0.7% more throughput than not using bpf_prog_pack." > > > > These metrics are completely arbitrary and opaque to us. We need > > something tangible and reproducible and I have been suggesting that > > from early on... > > > > I'm under the impression that the real missed, undocumented, major value-add > > here is that the old "BPF prog pack" strategy helps to reduce the direct map > > fragmentation caused by heavy use of the eBPF JIT programs and this in > > turn helps your overall random system performance (regardless of what > > it is you do). As I see it then the eBPF prog pack is just one strategy to > > try to mitigate memory fragmentation on the direct map caused by the the eBPF > > JIT programs, so the "slow down" your team has obvserved should be due to the > > eventual fragmentation caused on the direct map *while* eBPF programs > > get heavily used. > > I believe that while the eBPF prog pack is helpful in mitigation of the > direct map fragmentation caused by the eBPF JIT programs, the same strategy > of allocating a large page, splitting its PMD entry and then reusing the > memory for smaller allocations can be (and should be) generalized to other > use cases that require non-default permissions in the page table. Most > prominent use cases are those that allocate memory for code, but the same > approach is relevant for other cases, like secretmem or page table > protection with PKS. > > A while ago I've suggested to handle such caching of large pages at the > page allocator level, but when we discussed it at LSF/MM/BPF, prevailing > opinion was that added value does not justify changes to the page > allocator and it was suggested to handle such caching elsewhere. I saw that on the lwn coverage. > I had to put this project on a backburner for $VARIOUS_REASONS, but I still > think that we need a generic allocator for memory with non-default > permissions in the direct map and that code allocation should build on that > allocator. It seems this generalization of the bpf prog pack to possibly be used for modules / kprobes / ftrace is a small step in that direction. > All that said, the direct map fragmentation problem is currently relevant > only to x86 because it's the only architecture that supports splitting of > the large pages in the direct map. I was thinking even more long term too, using this as a proof of concept. If this practice in general helps with fragmentation, could it be used for experimetnation with compound pages later, as a way to reduce possible fragmentation. > > Mike Rapoport had presented about the Direct map fragmentation problem > > at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace / > > kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance > > evaluation on whether using 2M/1G pages aggressively for the kernel direct map > > help performance [1] ends up generally recommending huge pages. The work by Xing > > though was about using huge pages *alone*, not using a strategy such as in the > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs, > > and that I think is the real golden nugget here. > > > > I contend therefore that the theoretical reduction of iTLB misses by using > > huge pages for "bpf prog pack" is not what gets your systems to perform > > somehow better. It should be simply that it reduces fragmentation and > > *this* generally can help with performance long term. If this is accurate > > then let's please separate the two aspects to this. > > The direct map fragmentation is the reason for higher TLB miss rate, both > for iTLB and dTLB. OK so then whatever benchmark is running in tandem as eBPF JIT is hammered should *also* be measured with perf for iTLB and dTLB. ie, the patch can provide such results as a justifications. > Whenever a large page in the direct map is split, all > kernel accesses via the direct map will use small pages which requires > dealing with 512 page table entries instead of one for 2M range. > > Since small pages in the direct map are never collapsed back to large > pages, long living system that heavily uses eBPF programs will have its > direct map severely fragmented, higher TLB miss rate and worse overall > performance. Shouldn't compaction help with those situations? > > There's two aspects to what I would like to see from a performance > > perspective then actually mentioned in the commit logs: > > > > 1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution > > Vs not using it at all: > > ... > > > 2) Estimate in reduction on direct map fragmentation by using the "bpf > > prog pack" or this generalized solution: > > > > For this I'd expect a benchmark similar to the workload you guys > > run or something memory intensive, as eBPF JITs are heavily used, > > and after a certain amount of time somehow compute how fragmented > > memory is. The only sensible thing I can think to measure memory > > fragmentation is to look at the memory compaction index > > /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's > > ideas as I'm a mm n00b. > > The direct map fragmentation can be tracked with > > grep DirectMap /proc/meminfo > grep direct_map /proc/vmstat > > and by looking at /sys/kernel/debug/page_tables/kernel Thanks! Luis