On Wed, Nov 16, 2022 at 05:06:19PM -0800, Song Liu wrote: > Use execmem_alloc, execmem_free, and execmem_fill instead of > bpf_prog_pack_alloc, bpf_prog_pack_free, and bpf_arch_text_copy. > > execmem_free doesn't require extra size information. Therefore, the free > and error handling path can be simplified. > > There are some tests that show the benefit of execmem_alloc. > > Run 100 instances of the following benchmark from bpf selftests: > tools/testing/selftests/bpf/bench -w2 -d100 -a trig-kprobe > which loads 7 BPF programs, and triggers one of them. > > Then use perf to monitor TLB related counters: > perf stat -e iTLB-load-misses,itlb_misses.walk_completed_4k, \ > itlb_misses.walk_completed_2m_4m -a > > The following results are from a qemu VM with 32 cores. > > Before bpf_prog_pack: > iTLB-load-misses: 350k/s > itlb_misses.walk_completed_4k: 90k/s > itlb_misses.walk_completed_2m_4m: 0.1/s > > With bpf_prog_pack (current upstream): > iTLB-load-misses: 220k/s > itlb_misses.walk_completed_4k: 68k/s > itlb_misses.walk_completed_2m_4m: 0.2/s > > With execmem_alloc (with this set): > iTLB-load-misses: 185k/s > itlb_misses.walk_completed_4k: 58k/s > itlb_misses.walk_completed_2m_4m: 1/s Wonderful. It would be nice to have this integrated into the bpf selftest, instead of having to ask someone to try to repeat and decipher how to do the above. Completion time results would be useseful as well. And, then after try running this + another memory intensive benchmark as recently suggested, have it run for a while, and then re-run again as the direct map fragmentation should reveal that anything running at the end after execmem_alloc() should produce gravy results. Luis