On Mon, Nov 8, 2021 at 6:05 AM Hou Tao <houtao1@xxxxxxxxxx> wrote: > > HI, > > On 11/7/2021 2:43 AM, Alexei Starovoitov wrote: > > On Sat, Nov 06, 2021 at 09:28:22PM +0800, Hou Tao wrote: > >> The benchmark runs a loop 5000 times. In the loop it reads the file name > >> from kprobe argument into stack by using bpf_probe_read_kernel_str(), > >> and compares the file name with a target character or string. > >> > >> Three cases are compared: only compare one character, compare the whole > >> string by a home-made strncmp() and compare the whole string by > >> bpf_strcmp(). > >> > >> The following is the result: > >> > >> x86-64 host: > >> > >> one character: 2613499 ns > >> whole str by strncmp: 2920348 ns > >> whole str by helper: 2779332 ns > >> > >> arm64 host: > >> > >> one character: 3898867 ns > >> whole str by strncmp: 4396787 ns > >> whole str by helper: 3968113 ns > >> > >> Compared with home-made strncmp, the performance of bpf_strncmp helper > >> improves 80% under x86-64 and 600% under arm64. The big performance win > >> on arm64 may comes from its arch-optimized strncmp(). > > 80% and 600% improvement?! > > I don't understand how this math works. > > Why one char is barely different in total nsec than the whole string? > > The string shouldn't miscompare on the first char as far as I understand the test. > Because the result of "one character" includes the overhead of process filtering and > string read. > My bad, I should explain the tests results in more details. Maybe use bench framework for your benchmark? It allows to setup the benchmark and collect measurements in a more structured way. Check some existing benchmarks under benchs/ in selftests/bpf directory. To actually test just bpf_strncmp() don't add bpf_probe_read_kernel_str() into the loop logic, set your data in global variable and just search it. This will give you more accurate microbenchmark data. > > Three tests are exercised: > > (1) one character > Filter unexpected caller by bpf_get_current_pid_tgid() > Use bpf_probe_read_kernel_str() to read the file name into 64-bytes sized-buffer > in stack > Only compare the first character of file name > > (2) whole str by strncmp > Filter unexpected caller by bpf_get_current_pid_tgid() > Use bpf_probe_read_kernel_str() to read the file name into 64-bytes sized-buffer > in stack > Compare by using home-made strncmp(): the compared two strings are the same, so > the whole string is compared > > (3) whole str by helper > Filter unexpected caller by bpf_get_current_pid_tgid() > Use bpf_probe_read_kernel_str() to read the file name into 64-bytes sized-buffer > in stack > Compare by using bpf_strncmp: the compared two strings are the same, so > the whole string is compared > > Now "(1) one character" is used to calculate the overhead of process filtering and > string read. So under x86-64, the overhead of strncmp() is > > total time of whole str by strncmp test - total time of no character test = > 306849 ns. > > The overhead of bpf_strncmp() is: > total time of whole str by helper test - total time of no character test = > 165833 ns > > So the performance win is about (306849 / 165833 ) * 100 - 100 = ~85% > > And the win under arm64 is about (497920 / 69246) * 100 - 100 = ~600%