Re: [PATCH bpf-next] bpf/lpm_trie: inline longest_prefix_match for fastpath

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 15/03/2024 18.08, Jesper Dangaard Brouer wrote:


On 15/03/2024 16.03, Daniel Borkmann wrote:
On 3/12/24 4:17 PM, Jesper Dangaard Brouer wrote:
The BPF map type LPM (Longest Prefix Match) is used heavily
in production by multiple products that have BPF components.
Perf data shows trie_lookup_elem() and longest_prefix_match()
being part of kernels perf top.

You mention these are heavy hitters in prod ...

For every level in the LPM tree trie_lookup_elem() calls out
to longest_prefix_match().  The compiler is free to inline this
call, but chooses not to inline, because other slowpath callers
(that can be invoked via syscall) exists like trie_update_elem(),
trie_delete_elem() or trie_get_next_key().

  bcc/tools/funccount -Ti 1 'trie_lookup_elem|longest_prefix_match.isra.0'
  FUNC                                    COUNT
  trie_lookup_elem                       664945
  longest_prefix_match.isra.0           8101507

Observation on a single random metal shows a factor 12 between
the two functions. Given an average of 12 levels in the trie being
searched.

This patch force inlining longest_prefix_match(), but only for
the lookup fastpath to balance object instruction size.

  $ bloat-o-meter kernel/bpf/lpm_trie.o.orig-noinline kernel/bpf/lpm_trie.o
  add/remove: 1/1 grow/shrink: 1/0 up/down: 335/-4 (331)
  Function                                     old     new   delta
  trie_lookup_elem                             179     510    +331
  __BTF_ID__struct__lpm_trie__706741             -       4      +4
  __BTF_ID__struct__lpm_trie__706733             4       -      -4
  Total: Before=3056, After=3387, chg +10.83%

... and here you quote bloat-o-meter instead. But do you also see an
observable perf gain in prod after this change? (No objection from my
side but might be good to mention here.. given if not then why do the
change?)


I'm still waiting for more production servers to reboot into patched
kernels.  I do have some "low-level" numbers from previous generation
AMD servers, running kernel 6.1, which should be less affected by the
SRSO (than our 6.6 kernels). Waiting for newer generation to get kernel
updates, and especially 6.6 will be interesting.

There were no larger performance benefit on 6.6 it is basically the same.

Newer generation (11G) hardware latency overhead of trie_lookup_elem
 - avg 1181 nsecs for patched kernel
 - avg 1269 nsecs for non patched kernel
 - around 7% improvement or 88 ns


 From production measurements the latency overhead of trie_lookup_elem:
  - avg 1220 nsecs for patched kernel
  - avg 1329 nsecs for non patched kernel
  - around 8% improvement or 109 nanosec
  - given approx 12 calls "saved" this is 9 ns per function call
  - for reference on Intel I measured func call to cost 1.3ns
  - this extra overhead is caused by __x86_return_thunk().


I also see slight improvement in the graphs, but given how much
production varies I don't want to draw conclusions yet.


Still inconclusive due to variations in traffic distribution due to
load-balancing isn't perfect.


Details: Due to AMD mitigation for SRSO (Speculative Return Stack Overflow)
these function calls have additional overhead. On newer kernels this shows
up under srso_safe_ret() + srso_return_thunk(), and on older kernels (6.1)
under __x86_return_thunk(). Thus, for production workloads the biggest gain
comes from avoiding this mitigation overhead.

Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>

I'm going to send a V2, because kernel doc processing found an issue:
 - https://netdev.bots.linux.dev/static/nipa/834681/13590144/kdoc/stderr

I'll try to incorporate the production improvements we are seeing, but
it feels wrong to write about kernel 6.1 and 6.6 improvements, but I
(currently) cannot deploy a bpf-next kernel in production (but I do test
latest kernels in my local testlab).

--Jesper




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux