Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec

Mike Rapoport <rppt@xxxxxxxxxx> · Thu, 3 Nov 2022 17:51:57 +0200

Hi Luis,

Thanks for looping me in.

On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote:
> On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote:
> > vmalloc_exec is used to allocate memory to host dynamic kernel text
> > (modules, BPF programs, etc.) with huge pages. This is similar to the
> > proposal by Peter in [1].
> 
> This is allg reat but we need to clarify *why* we would go through the
> trouble.  So if folks are not to excited about this series, that's
> probably why. IMHO it lacks substance for rationale, **and** implies a few
> gains without any *clear* performance metrics. I have 0 experience with
> mm so I'd like other's feedback on my this -- I'm just trying to do
> decipher rationale from prior "bpf prog pack" efforts.
> 
> I'm sensing that the cables in messaging are a bit crossed here and we need
> to provide a bit better full picture for rationale and this is being
> completely missed and this work is being undersold.  If my assessment is
> accurate though, the bpf prog pack strategy with sharing huge pages may prove
> useful long term for other things than just modules / ftrace / kprobes.
> 
> I was surprised to see this entire patch series upgrade from RFC to proper
> PATCH form now completely fails to mention any of the original motivations
> behind the "BPF prog pack", which you are doing a true heroic effort to try to
> generalize as the problem is hard. Let me try to help with that. The rationale
> for the old BPF prog pack is documented as follows:
> 
> * Most BPF programs are pretty small. Allocating a hole page for each
> * program is sometime a waste. Many small bpf program also adds pressure
> * to instruction TLB. To solve this issue, we introduce a BPF program pack
> * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
> * to host BPF programs.
> 
> Previously you have also stated in earlier versions of this patch set:
> 
>   "Most BPF programs are small, but they consume a page each. For systems
>    with busy traffic and many BPF programs, this could also add significant
>    pressure to instruction TLB. High iTLB pressure usually causes slow down
>    for the whole system, which includes visible performance
>    degradation for production workloads."
> 
> So it is implied here that one of the benefits is to help reduce iTLB misses.
> But that's it. We have no visible numbers to look at and for what... But
> reducing iTLB misses doesn't always have a complete direct correlation
> with improving things, but if the code change is small enough it obviously
> makes sense to apply. If the change is a bit more intrusive, as in this
> patch series a bit more rationale should be provided.
> 
> Other than the "performance aspects" of your patchset, the *main* reason
> I am engaged and like it is it reduces the nasty mess of semantics on
> dealing with special permissions on pages which we see in modules and a
> few other places which today completely open code it. That proves error
> prone and I'm glad to see efforts to generalize that nastiness. So
> please ensure this is added as part of the documented rationale. Even
> if the iTLB miss ratio improvement is not astronomical I believe that
> the gains in sanity on improving semantics on special pages and sharing code
> make it well worthwhile. The iTLB miss ratio improvement is just a small
> cherry on top.
> 
> Going back to performance aspects, when Linus had poked for more details
> about this your have elaborated further:
> 
>   "we have seen direct map fragmentation causing visible
>    performance drop for our major services. This is the shadow 
>    production benchmark, so it is not possible to run it out of 
>    our data centers. Tracing showed that BPF program was the top 
>    trigger of these direct map splits."
> 
> And the only other metric we have is:
> 
>   "For our web service production benchmark, bpf_prog_pack on 4kB pages
>    gives 0.5% to 0.7% more throughput than not using bpf_prog_pack."
> 
> These metrics are completely arbitrary and opaque to us. We need
> something tangible and reproducible and I have been suggesting that
> from early on...
> 
> I'm under the impression that the real missed, undocumented, major value-add
> here is that the old "BPF prog pack" strategy helps to reduce the direct map
> fragmentation caused by heavy use of the eBPF JIT programs and this in
> turn helps your overall random system performance (regardless of what
> it is you do). As I see it then the eBPF prog pack is just one strategy to
> try to mitigate memory fragmentation on the direct map caused by the the eBPF
> JIT programs, so the "slow down" your team has obvserved should be due to the
> eventual fragmentation caused on the direct map *while* eBPF programs
> get heavily used.

I believe that while the eBPF prog pack is helpful in mitigation of the
direct map fragmentation caused by the eBPF JIT programs, the same strategy
of allocating a large page, splitting its PMD entry and then reusing the
memory for smaller allocations can be (and should be) generalized to other
use cases that require non-default permissions in the page table.  Most
prominent use cases are those that allocate memory for code, but the same
approach is relevant for other cases, like secretmem or page table
protection with PKS.

A while ago I've suggested to handle such caching of large pages at the
page allocator level, but when we discussed it at LSF/MM/BPF, prevailing
opinion was that added value does not justify changes to the page
allocator and it was suggested to handle such caching elsewhere. 

I had to put this project on a backburner for $VARIOUS_REASONS, but I still
think that we need a generic allocator for memory with non-default
permissions in the direct map and that code allocation should build on that
allocator.

All that said, the direct map fragmentation problem is currently relevant
only to x86 because it's the only architecture that supports splitting of
the large pages in the direct map.

> Mike Rapoport had presented about the Direct map fragmentation problem
> at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
> kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
> evaluation on whether using 2M/1G pages aggressively for the kernel direct map
> help performance [1] ends up generally recommending huge pages. The work by Xing
> though was about using huge pages *alone*, not using a strategy such as in the
> "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
> and that I think is the real golden nugget here.
> 
> I contend therefore that the theoretical reduction of iTLB misses by using
> huge pages for "bpf prog pack" is not what gets your systems to perform
> somehow better. It should be simply that it reduces fragmentation and
> *this* generally can help with performance long term. If this is accurate
> then let's please separate the two aspects to this.

The direct map fragmentation is the reason for higher TLB miss rate, both
for iTLB and dTLB. Whenever a large page in the direct map is split, all
kernel accesses via the direct map will use small pages which requires
dealing with 512 page table entries instead of one for 2M range.

Since small pages in the direct map are never collapsed back to large
pages, long living system that heavily uses eBPF programs will have its
direct map severely fragmented, higher TLB miss rate and worse overall
performance. 

> There's two aspects to what I would like to see from a performance
> perspective then actually mentioned in the commit logs:
> 
> 1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution
>    Vs not using it at all:

... 

> 2) Estimate in reduction on direct map fragmentation by using the "bpf
>    prog pack" or this generalized solution:
> 
>    For this I'd expect a benchmark similar to the workload you guys
>    run or something memory intensive, as eBPF JITs are heavily used,
>    and after a certain amount of time somehow compute how fragmented
>    memory is. The only sensible thing I can think to measure memory
>    fragmentation is to look at the memory compaction index
>    /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's
>    ideas as I'm a mm n00b.

The direct map fragmentation can be tracked with 

	grep DirectMap /proc/meminfo
	grep direct_map /proc/vmstat

and by looking at /sys/kernel/debug/page_tables/kernel

> [0] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
> [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@xxxxxxxxxxxxxxx/
> 
>   Luis

-- 
Sincerely yours,
Mike.