Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec

Song Liu <song@xxxxxxxxxx> · Thu, 3 Nov 2022 14:41:42 -0700

On Thu, Nov 3, 2022 at 2:19 PM Edgecombe, Rick P
<rick.p.edgecombe@xxxxxxxxx> wrote:
>
> On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote:
> > > > Mike Rapoport had presented about the Direct map fragmentation
> > > > problem
> > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF /
> > > > ftrace /
> > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021
> > > > performance
> > > > evaluation on whether using 2M/1G pages aggressively for the
> > > > kernel direct map
> > > > help performance [1] ends up generally recommending huge pages.
> > > > The work by Xing
> > > > though was about using huge pages *alone*, not using a strategy
> > > > such as in the
> > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF
> > > > programs,
> > > > and that I think is the real golden nugget here.
> > > >
> > > > I contend therefore that the theoretical reduction of iTLB misses
> > > > by using
> > > > huge pages for "bpf prog pack" is not what gets your systems to
> > > > perform
> > > > somehow better. It should be simply that it reduces fragmentation
> > > > and
> > > > *this* generally can help with performance long term. If this is
> > > > accurate
> > > > then let's please separate the two aspects to this.
> > >
> > > The direct map fragmentation is the reason for higher TLB miss
> > > rate, both
> > > for iTLB and dTLB.
> >
> > OK so then whatever benchmark is running in tandem as eBPF JIT is
> > hammered
> > should *also* be measured with perf for iTLB and dTLB. ie, the patch
> > can
> > provide such results as a justifications.
>
> Song had done some tests on the old prog pack version that to me seemed
> to indicate most (or possibly all) of the benefit was direct map
> fragmentation reduction. This was surprised me, since 2MB kernel text
> has shown to be beneficial.
>
> Otherwise +1 to all these comments. This should be clear about what the
> benefits are. I would add, that this is also much nicer about TLB
> shootdowns than the existing way of loading text and saves some memory.
>
> So I think there are sort of four areas of improvements:
> 1. Direct map fragmentation reduction (dTLB miss improvements). This
> sort of does it as a side effect in this series, and the solution Mike
> is talking about is a more general, probably better one.
> 2. 2MB mapped JITs. This is the iTLB side. I think this is a decent
> solution for this, but surprisingly it doesn't seem to be useful for
> JITs. (modules testing TBD)
> 3. Loading text to reused allocation with per-cpu mappings. This
> reduces TLB shootdowns, which are a short term load and teardown time
> performance drag. My understanding is this is more of a problem on
> bigger systems with many CPUs. This series does a decent job at this,
> but the solution is not compatible with modules. Maybe ok since modules
> don't load as often as JITs.
> 4. Having BPF progs share pages. This saves memory. This series could
> probably easily get a number for how much.
>

Hi Luis, Rick, and Mike,

Thanks a lot for helping me organize this information. Totally agree with
all these comments. I will add more data to the next revision.

Besides the motivation improvement, could you please also share your
comments on:
1. The logic/design of the vmalloc_exec() et. al. APIs;
2. The naming of these functions. Does  execmem_[alloc|free|fill|cpy]
  (as suggested by Chritoph) sound good?

Thanks,
Song