Re: [PATCH bpf-next v3 4/6] bpf: use execmem_alloc for bpf program and bpf dispatcher

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Nov 17, 2022 at 12:01:30PM -0800, Luis Chamberlain wrote:
> On Wed, Nov 16, 2022 at 06:10:23PM -0800, Alexei Starovoitov wrote:
> > On Wed, Nov 16, 2022 at 6:04 PM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 05:06:19PM -0800, Song Liu wrote:
> > > > Use execmem_alloc, execmem_free, and execmem_fill instead of
> > > > bpf_prog_pack_alloc, bpf_prog_pack_free, and bpf_arch_text_copy.
> > > >
> > > > execmem_free doesn't require extra size information. Therefore, the free
> > > > and error handling path can be simplified.
> > > >
> > > > There are some tests that show the benefit of execmem_alloc.
> > > >
> > > > Run 100 instances of the following benchmark from bpf selftests:
> > > >   tools/testing/selftests/bpf/bench -w2 -d100 -a trig-kprobe
> > > > which loads 7 BPF programs, and triggers one of them.
> > > >
> > > > Then use perf to monitor TLB related counters:
> > > >    perf stat -e iTLB-load-misses,itlb_misses.walk_completed_4k, \
> > > >            itlb_misses.walk_completed_2m_4m -a
> > > >
> > > > The following results are from a qemu VM with 32 cores.
> > > >
> > > > Before bpf_prog_pack:
> > > >   iTLB-load-misses: 350k/s
> > > >   itlb_misses.walk_completed_4k: 90k/s
> > > >   itlb_misses.walk_completed_2m_4m: 0.1/s
> > > >
> > > > With bpf_prog_pack (current upstream):
> > > >   iTLB-load-misses: 220k/s
> > > >   itlb_misses.walk_completed_4k: 68k/s
> > > >   itlb_misses.walk_completed_2m_4m: 0.2/s
> > > >
> > > > With execmem_alloc (with this set):
> > > >   iTLB-load-misses: 185k/s
> > > >   itlb_misses.walk_completed_4k: 58k/s
> > > >   itlb_misses.walk_completed_2m_4m: 1/s
> > >
> > > Wonderful.
> > >
> > > It would be nice to have this integrated into the bpf selftest,
> > 
> > 
> > No. Luis please stop suggesting things that don't make sense.
> > selftest/bpf are not doing performance benchmarking.
> > We have the 'bench' tool for that.
> > That's what Song used and it's only running standalone
> > and not part of any CI.
> 
> I'm not suggesting to instantiate the VM or crap like that, I'm just
> asking for the simple script to run 100 instances. This allows folks
> to reproduce results in an easy way.
> 
> Whether or not you don't want that for selftests/bpf -- fine, a simple
> in commit script can easily represent a loop in bash if that's all
> that was done.

There's also the issue of assuming virtual iTLB stats are reliable
representations of what we see on bare metal, so it'd be nice to get
bare metal stats too.

  Luis




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux