On 6/20/22 7:57 PM, Christoph Lameter wrote:
On Sat, 18 Jun 2022, Rongwei Wang wrote:
Well the cycle reduction is strange. Tests are not done in the same
environment? Maybe good to not use NUMA or bind to the same cpu
It's the same environment. I can sure. And there are four nodes (32G per-node
and 8 cores per-node) in my test environment. whether I need to test in one
node? If right, I can try.
Ok in a NUMA environment the memory allocation is randomized on bootup.
You may get different numbers after you reboot the system. Try to switch
NUMA off. Use s a single node to get consistent numbers.
Sorry for late reply.
At first, let me share my test environment: arm64 VM (32 cores and 128G
memory), and I only configure one node for this VM.
Plus, I had use 'numactl -N 0 -m 0 qemu-kvm ...' to start this VM. It
mainly to avoid data jitter.
And I can sure my physical machine where this VM is located has same
configuration when I tested prototype kernel and patched kernel. If
above test environment has any problems, please let me know.
The following is the latest data:
Single thread testing
1. Kmalloc: Repeatedly allocate then free test
before fix
kmalloc kfree kmalloc kfree
10000 times 8 4 cycles 5 cycles 4 cycles 5 cycles
10000 times 16 3 cycles 5 cycles 3 cycles 5 cycles
10000 times 32 3 cycles 5 cycles 3 cycles 5 cycles
10000 times 64 3 cycles 5 cycles 3 cycles 5 cycles
10000 times 128 3 cycles 5 cycles 3 cycles 5 cycles
10000 times 256 14 cycles 9 cycles 6 cycles 8 cycles
10000 times 512 9 cycles 8 cycles 9 cycles 10 cycles
10000 times 1024 48 cycles 10 cycles 6 cycles 10 cycles
10000 times 2048 31 cycles 12 cycles 35 cycles 13 cycles
10000 times 4096 96 cycles 17 cycles 96 cycles 18 cycles
10000 times 8192 188 cycles 27 cycles 190 cycles 27 cycles
10000 times 16384 117 cycles 38 cycles 115 cycles 38 cycles
2. Kmalloc: alloc/free test
before fix
10000 times kmalloc(8)/kfree 3 cycles 3 cycles
10000 times kmalloc(16)/kfree 3 cycles 3 cycles
10000 times kmalloc(32)/kfree 3 cycles 3 cycles
10000 times kmalloc(64)/kfree 3 cycles 3 cycles
10000 times kmalloc(128)/kfree 3 cycles 3 cycles
10000 times kmalloc(256)/kfree 3 cycles 3 cycles
10000 times kmalloc(512)/kfree 3 cycles 3 cycles
10000 times kmalloc(1024)/kfree 3 cycles 3 cycles
10000 times kmalloc(2048)/kfree 3 cycles 3 cycles
10000 times kmalloc(4096)/kfree 3 cycles 3 cycles
10000 times kmalloc(8192)/kfree 3 cycles 3 cycles
10000 times kmalloc(16384)/kfree 33 cycles 33 cycles
Concurrent allocs
before fix
Kmalloc N*alloc N*free(8) Average=13/14 Average=14/15
Kmalloc N*alloc N*free(16) Average=13/15 Average=13/15
Kmalloc N*alloc N*free(32) Average=13/15 Average=13/15
Kmalloc N*alloc N*free(64) Average=13/15 Average=13/15
Kmalloc N*alloc N*free(128) Average=13/15 Average=13/15
Kmalloc N*alloc N*free(256) Average=137/29 Average=134/39
Kmalloc N*alloc N*free(512) Average=61/29 Average=64/28
Kmalloc N*alloc N*free(1024) Average=465/50 Average=656/55
Kmalloc N*alloc N*free(2048) Average=503/97 Average=422/97
Kmalloc N*alloc N*free(4096) Average=1592/206 Average=1624/207
Kmalloc N*(alloc free)(8) Average=3 Average=3
Kmalloc N*(alloc free)(16) Average=3 Average=3
Kmalloc N*(alloc free)(32) Average=3 Average=3
Kmalloc N*(alloc free)(64) Average=3 Average=3
Kmalloc N*(alloc free)(128) Average=3 Average=3
Kmalloc N*(alloc free)(256) Average=3 Average=3
Kmalloc N*(alloc free)(512) Average=3 Average=3
Kmalloc N*(alloc free)(1024) Average=3 Average=3
Kmalloc N*(alloc free)(2048) Average=3 Average=3
Kmalloc N*(alloc free)(4096) Average=3 Average=3
Can the above data indicate that this modification (only works when
kmem_cache_debug(s) is true) does not introduce significant performance
impact?
Thanks for your time.
It maybe useful to figure out what memory structure causes the increase in
latency in a NUMA environment. If you can figure that out and properly
allocate the memory structure that causes the increases in latency then
you may be able to increase the performance of the allocator.