Re: [PATCH v3 0/3] mm: kmemleak: Use a memory pool for kmemleak object allocations

Noam Stolero <noams@xxxxxxxxxxxx> · Tue, 3 Dec 2019 15:51:50 +0000

On 8/12/2019 7:06 PM, Catalin Marinas wrote:

Following the discussions on v2 of this patch(set) [1], this series
takes slightly different approach:

- it implements its own simple memory pool that does not rely on the
  slab allocator

- drops the early log buffer logic entirely since it can now allocate
  metadata from the memory pool directly before kmemleak is fully
  initialised

- CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE option is renamed to
  CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE

- moves the kmemleak_init() call earlier (mm_init())

- to avoid a separate memory pool for struct scan_area, it makes the
  tool robust when such allocations fail as scan areas are rather an
  optimisation

[1] http://lkml.kernel.org/r/20190727132334.9184-1-catalin.marinas@xxxxxxx

Catalin Marinas (3):
  mm: kmemleak: Make the tool tolerant to struct scan_area allocation
    failures
  mm: kmemleak: Simple memory allocation pool for kmemleak objects
  mm: kmemleak: Use the memory pool for early allocations

 init/main.c       |   2 +-
 lib/Kconfig.debug |  11 +-
 mm/kmemleak.c     | 325 ++++++++++++----------------------------------
 3 files changed, 91 insertions(+), 247 deletions(-)

Hi Catalin,

We observe severe degradation in our network performance affecting all of our NICs.

The degradation is directly linked to this patch.

What we run:

Simple Iperf TCP loopback with 8 streams on ConnectX5-100GbE.

Since it's a loopback test, traffic goes from the socket through the IP

stack and back to the socket, without going through the NIC driver.

What we observe:

Throughput performance:

- Kernel 5.3GA - Throughput was 230Gbps

- Kernel 5.4-rc1 and later - Throughput is 50Gbps

CPU utilization-wise:

Using perf we see much higher CPU utilization with kmem related functions:

Function                  | Kernel 5.3GA | Kernel 5.4-rc1 and later
--------------------------|--------------|-------------------------
__kfree_skb               | 3.4%         | 11.0%
kmem_cache_free           | 0.3%         | 10.2%
__alloc_skb               | 2.2%         | 26.0%
queued_spin_lock_slowpath | 1.3%         | 26.3%
delete_object_full        | Not used     | 18.0%

'delete_object_full()' function seems like the one which starts the slower flow.

One of the conditions causing this function to kick into action is 'kmemleak_free_enabled' flag,

which was changed to enabled by default by your series.

Reverting discussed series restore the performance almost completely.

Can you help shed light on the subject?

Attachment:
perf_record_loopback_8_streams_k5.3_FG.svg

Description: perf_record_loopback_8_streams_k5.3_FG.svg
Attachment:
perf_record_loopback_8_streams_k5.4_FG.svg

Description: perf_record_loopback_8_streams_k5.4_FG.svg