Re: [RFC PATCH bpf-next 2/2] selftests/bpf: add benchmark bpf_strcmp

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 8 Nov 2021 10:00:30 -0800

On Mon, Nov 8, 2021 at 6:05 AM Hou Tao <houtao1@xxxxxxxxxx> wrote:
>
> HI,
>
> On 11/7/2021 2:43 AM, Alexei Starovoitov wrote:
> > On Sat, Nov 06, 2021 at 09:28:22PM +0800, Hou Tao wrote:
> >> The benchmark runs a loop 5000 times. In the loop it reads the file name
> >> from kprobe argument into stack by using bpf_probe_read_kernel_str(),
> >> and compares the file name with a target character or string.
> >>
> >> Three cases are compared: only compare one character, compare the whole
> >> string by a home-made strncmp() and compare the whole string by
> >> bpf_strcmp().
> >>
> >> The following is the result:
> >>
> >> x86-64 host:
> >>
> >> one character: 2613499 ns
> >> whole str by strncmp: 2920348 ns
> >> whole str by helper: 2779332 ns
> >>
> >> arm64 host:
> >>
> >> one character: 3898867 ns
> >> whole str by strncmp: 4396787 ns
> >> whole str by helper: 3968113 ns
> >>
> >> Compared with home-made strncmp, the performance of bpf_strncmp helper
> >> improves 80% under x86-64 and 600% under arm64. The big performance win
> >> on arm64 may comes from its arch-optimized strncmp().
> > 80% and 600% improvement?!
> > I don't understand how this math works.
> > Why one char is barely different in total nsec than the whole string?
> > The string shouldn't miscompare on the first char as far as I understand the test.
> Because the result of "one character" includes the overhead of process filtering and
> string read.
> My bad, I should explain the tests results in more details.

Maybe use bench framework for your benchmark? It allows to setup the
benchmark and collect measurements in a more structured way. Check
some existing benchmarks under benchs/ in selftests/bpf directory.

To actually test just bpf_strncmp() don't add
bpf_probe_read_kernel_str() into the loop logic, set your data in
global variable and just search it. This will give you more accurate
microbenchmark data.

>
> Three tests are exercised:
>
> (1) one character
> Filter unexpected caller by bpf_get_current_pid_tgid()
> Use bpf_probe_read_kernel_str() to read the file name into 64-bytes sized-buffer
> in stack
> Only compare the first character of file name
>
> (2) whole str by strncmp
> Filter unexpected caller by bpf_get_current_pid_tgid()
> Use bpf_probe_read_kernel_str() to read the file name into 64-bytes sized-buffer
> in stack
> Compare by using home-made strncmp(): the compared two strings are the same, so
> the whole string is compared
>
> (3) whole str by helper
> Filter unexpected caller by bpf_get_current_pid_tgid()
> Use bpf_probe_read_kernel_str() to read the file name into 64-bytes sized-buffer
> in stack
> Compare by using bpf_strncmp: the compared two strings are the same, so
> the whole string is compared
>
> Now "(1) one character" is used to calculate the overhead of process filtering and
> string read. So under x86-64, the overhead of strncmp() is
>
>   total time of whole str by strncmp  test  - total time of no character test =
> 306849 ns.
>
> The overhead of bpf_strncmp() is:
>   total time of whole str by helper test - total time of no character test =
> 165833 ns
>
> So the performance win is about (306849  / 165833 ) * 100 - 100 = ~85%
>
> And the win under arm64 is about (497920 / 69246) * 100 - 100 = ~600%