Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec

"Edgecombe, Rick P" <rick.p.edgecombe@xxxxxxxxx> · Thu, 3 Nov 2022 21:19:25 +0000

On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote:
> > > Mike Rapoport had presented about the Direct map fragmentation
> > > problem
> > > at Plumbers 2021 [0], and clearly mentioned modules / BPF /
> > > ftrace /
> > > kprobes as possible sources for this. Then Xing Zhengjun's 2021
> > > performance
> > > evaluation on whether using 2M/1G pages aggressively for the
> > > kernel direct map
> > > help performance [1] ends up generally recommending huge pages.
> > > The work by Xing
> > > though was about using huge pages *alone*, not using a strategy
> > > such as in the
> > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF
> > > programs,
> > > and that I think is the real golden nugget here.
> > > 
> > > I contend therefore that the theoretical reduction of iTLB misses
> > > by using
> > > huge pages for "bpf prog pack" is not what gets your systems to
> > > perform
> > > somehow better. It should be simply that it reduces fragmentation
> > > and
> > > *this* generally can help with performance long term. If this is
> > > accurate
> > > then let's please separate the two aspects to this.
> > 
> > The direct map fragmentation is the reason for higher TLB miss
> > rate, both
> > for iTLB and dTLB.
> 
> OK so then whatever benchmark is running in tandem as eBPF JIT is
> hammered
> should *also* be measured with perf for iTLB and dTLB. ie, the patch
> can
> provide such results as a justifications.

Song had done some tests on the old prog pack version that to me seemed
to indicate most (or possibly all) of the benefit was direct map
fragmentation reduction. This was surprised me, since 2MB kernel text
has shown to be beneficial.

Otherwise +1 to all these comments. This should be clear about what the
benefits are. I would add, that this is also much nicer about TLB
shootdowns than the existing way of loading text and saves some memory.

So I think there are sort of four areas of improvements:
1. Direct map fragmentation reduction (dTLB miss improvements). This
sort of does it as a side effect in this series, and the solution Mike
is talking about is a more general, probably better one.
2. 2MB mapped JITs. This is the iTLB side. I think this is a decent
solution for this, but surprisingly it doesn't seem to be useful for
JITs. (modules testing TBD)
3. Loading text to reused allocation with per-cpu mappings. This
reduces TLB shootdowns, which are a short term load and teardown time
performance drag. My understanding is this is more of a problem on
bigger systems with many CPUs. This series does a decent job at this,
but the solution is not compatible with modules. Maybe ok since modules
don't load as often as JITs.
4. Having BPF progs share pages. This saves memory. This series could
probably easily get a number for how much.