2014-12-17 21:08 GMT+09:00 Jesper Dangaard Brouer <brouer@xxxxxxxxxx>: > On Wed, 17 Dec 2014 16:13:49 +0900 Joonsoo Kim <js1304@xxxxxxxxx> wrote: > >> Ping... and I found another way to remove preempt_disable/enable >> without complex changes. >> >> What we want to ensure is getting tid and kmem_cache_cpu >> on the same cpu. We can achieve that goal with below condition loop. >> >> I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over >> kmem_cache_alloc+free in CONFIG_PREEMPT. >> >> 14.5 ns -> 13.8 ns > > Hi Kim, > > I've tested you patch. Full report below patch. > > Summary, I'm seeing 18.599 ns -> 17.523 ns (-1.076ns better). Thanks for testing! :) It will help to convince others. Thanks. > For network overload tests: > > Dropping packets in iptables raw, which is hitting the slub fast-path. > Here I'm seeing an improvement of 3ns. > > For IP-forward, which is also invoking the slub slower path, I'm seeing > an improvement of 6ns (I were not expecting to see any improvement > here, the kmem_cache_alloc code is 24bytes smaller, so perhaps it's > saving some icache). > > Full report below patch... > >> See following patch. >> >> Thanks. >> >> ----------->8------------- >> diff --git a/mm/slub.c b/mm/slub.c >> index 95d2142..e537af5 100644 >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -2399,8 +2399,10 @@ redo: >> * on a different processor between the determination of the pointer >> * and the retrieval of the tid. >> */ >> - preempt_disable(); >> - c = this_cpu_ptr(s->cpu_slab); >> + do { >> + tid = this_cpu_read(s->cpu_slab->tid); >> + c = this_cpu_ptr(s->cpu_slab); >> + } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid)); >> >> /* >> * The transaction ids are globally unique per cpu and per operation on >> @@ -2408,8 +2410,6 @@ redo: >> * occurs on the right processor and that there was no operation on the >> * linked list in between. >> */ >> - tid = c->tid; >> - preempt_enable(); >> >> object = c->freelist; >> page = c->page; >> @@ -2655,11 +2655,10 @@ redo: >> * data is retrieved via this pointer. If we are on the same cpu >> * during the cmpxchg then the free will succedd. >> */ >> - preempt_disable(); >> - c = this_cpu_ptr(s->cpu_slab); >> - >> - tid = c->tid; >> - preempt_enable(); >> + do { >> + tid = this_cpu_read(s->cpu_slab->tid); >> + c = this_cpu_ptr(s->cpu_slab); >> + } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid)); >> >> if (likely(page == c->page)) { >> set_freepointer(s, object, c->freelist); > > SLUB evaluation 03 > ================== > > Testing patch from Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> slub fast-path > preempt_{disable,enable} avoidance. > > Kernel > ====== > Compiler: GCC 4.9.1 > > Kernel config :: > > $ grep PREEMPT .config > CONFIG_PREEMPT_RCU=y > CONFIG_PREEMPT_NOTIFIERS=y > # CONFIG_PREEMPT_NONE is not set > # CONFIG_PREEMPT_VOLUNTARY is not set > CONFIG_PREEMPT=y > CONFIG_PREEMPT_COUNT=y > # CONFIG_DEBUG_PREEMPT is not set > > $ egrep -e "SLUB|SLAB" .config > # CONFIG_SLUB_DEBUG is not set > # CONFIG_SLAB is not set > CONFIG_SLUB=y > # CONFIG_SLUB_CPU_PARTIAL is not set > # CONFIG_SLUB_STATS is not set > > On top of:: > > commit f96fe225677b3efb74346ebd56fafe3997b02afa > Merge: 5543798 eea3e8f > Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > Date: Fri Dec 12 16:11:12 2014 -0800 > > Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net > > > Setup > ===== > > netfilter_unload_modules.sh > netfilter_unload_modules.sh > sudo rmmod nf_reject_ipv4 nf_reject_ipv6 > > base_device_setup.sh eth4 # 10G sink/receiving interface (ixgbe) > base_device_setup.sh eth5 > sudo ethtool --coalesce eth4 rx-usecs 30 > sudo ip neigh add 192.168.21.66 dev eth5 lladdr 00:00:ba:d0:ba:d0 > sudo ip route add 198.18.0.0/15 via 192.168.21.66 dev eth5 > > > # sudo tuned-adm active > Current active profile: latency-performance > > Drop in raw > ----------- > alias iptables='sudo iptables' > iptables -t raw -N simple || iptables -t raw -F simple > iptables -t raw -I simple -d 198.18.0.0/15 -j DROP > iptables -t raw -D PREROUTING -j simple > iptables -t raw -I PREROUTING -j simple > > Generator > --------- > ./pktgen02_burst.sh -d 198.18.0.2 -i eth8 -m 90:E2:BA:0A:56:B4 -b 8 -t 3 -s 64 > > > Patch by Joonsoo Kim to avoid preempt in slub > ============================================= > > baseline: without patch > ----------------------- > > baseline kernel v3.18-7016-gf96fe22 at commit f96fe22567 > > Type:kmem fastpath reuse Per elem: 46 cycles(tsc) 18.599 ns > - (measurement period time:1.859917529 sec time_interval:1859917529) > - (invoke count:100000000 tsc_interval:4649791431) > > alloc N-pattern before free with 256 elements > > Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.077 ns > - (measurement period time:1.025993290 sec time_interval:1025993290) > - (invoke count:25600000 tsc_interval:2564981743) > > single flow/CPU > * IP-forward > - instant rx:0 tx:1165376 pps n:60 average: rx:0 tx:1165928 pps > (instant variation TX -0.407 ns (min:-0.828 max:0.507) RX 0.000 ns) > * Drop in RAW (slab fast-path test) > - instant rx:3245248 tx:0 pps n:60 average: rx:3245325 tx:0 pps > (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.007 ns) > > Christoph's slab_test, baseline kernel (at commit f96fe22567):: > > Single thread testing > ===================== > 1. Kmalloc: Repeatedly allocate then free test > 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles > 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles > 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles > 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles > 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles > 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles > 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles > 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles > 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles > 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles > 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles > 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles > 2. Kmalloc: alloc/free test > 10000 times kmalloc(8)/kfree -> 68 cycles > 10000 times kmalloc(16)/kfree -> 68 cycles > 10000 times kmalloc(32)/kfree -> 69 cycles > 10000 times kmalloc(64)/kfree -> 68 cycles > 10000 times kmalloc(128)/kfree -> 68 cycles > 10000 times kmalloc(256)/kfree -> 68 cycles > 10000 times kmalloc(512)/kfree -> 74 cycles > 10000 times kmalloc(1024)/kfree -> 75 cycles > 10000 times kmalloc(2048)/kfree -> 74 cycles > 10000 times kmalloc(4096)/kfree -> 74 cycles > 10000 times kmalloc(8192)/kfree -> 75 cycles > 10000 times kmalloc(16384)/kfree -> 510 cycles > > $ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page' > ffffffff81163bd0 00000000000000e1 T kmem_cache_alloc > ffffffff81163ac0 000000000000010c T kmem_cache_alloc_node > ffffffff81162cb0 000000000000013b T kmem_cache_free > > > with patch > ---------- > > single flow/CPU > * IP-forward > - instant rx:0 tx:1174652 pps n:60 average: rx:0 tx:1174222 pps > (instant variation TX 0.311 ns (min:-0.230 max:1.018) RX 0.000 ns) > * compare against baseline: > - 1174222-1165928 = +8294pps > - (1/1174222*10^9)-(1/1165928*10^9) = -6.058ns > > * Drop in RAW (slab fast-path test) > - instant rx:3277440 tx:0 pps n:74 average: rx:3277737 tx:0 pps > (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.028 ns) > * compare against baseline: > - 3277737-3245325 = +32412 pps > - (1/3277737*10^9)-(1/3245325*10^9) = -3.047ns > > SLUB fast-path test: time_bench_kmem_cache1 > * modprobe time_bench_kmem_cache1 ; rmmod time_bench_kmem_cache1; sudo dmesg -c > > Type:kmem fastpath reuse Per elem: 43 cycles(tsc) 17.523 ns (step:0) > - (measurement period time:1.752338378 sec time_interval:1752338378) > - (invoke count:100000000 tsc_interval:4380843588) > * difference: 17.523 - 18.599 = -1.076ns > > alloc N-pattern before free with 256 elements > > Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.369 ns (step:0) > - (measurement period time:1.033447112 sec time_interval:1033447112) > - (invoke count:25600000 tsc_interval:2583616203) > * difference: 40.369 - 40.077 = +0.292ns > > > Christoph's slab_test:: > > Single thread testing > ===================== > 1. Kmalloc: Repeatedly allocate then free test > 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles > 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles > 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles > 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles > 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles > 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles > 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles > 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles > 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles > 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles > 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles > 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles > 2. Kmalloc: alloc/free test > 10000 times kmalloc(8)/kfree -> 65 cycles > 10000 times kmalloc(16)/kfree -> 66 cycles > 10000 times kmalloc(32)/kfree -> 65 cycles > 10000 times kmalloc(64)/kfree -> 66 cycles > 10000 times kmalloc(128)/kfree -> 66 cycles > 10000 times kmalloc(256)/kfree -> 71 cycles > 10000 times kmalloc(512)/kfree -> 72 cycles > 10000 times kmalloc(1024)/kfree -> 71 cycles > 10000 times kmalloc(2048)/kfree -> 71 cycles > 10000 times kmalloc(4096)/kfree -> 71 cycles > 10000 times kmalloc(8192)/kfree -> 65 cycles > 10000 times kmalloc(16384)/kfree -> 511 cycles > > $ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page' > ffffffff81163ba0 00000000000000c9 T kmem_cache_alloc > ffffffff81163aa0 00000000000000f8 T kmem_cache_alloc_node > ffffffff81162cb0 0000000000000133 T kmem_cache_free > > > > Kernel size change > ------------------ > > $ scripts/bloat-o-meter vmlinux vmlinux-kim-preempt-avoid > add/remove: 0/0 grow/shrink: 0/8 up/down: 0/-248 (-248) > function old new delta > kmem_cache_free 315 307 -8 > kmem_cache_alloc_node 268 248 -20 > kmem_cache_alloc 225 201 -24 > kfree 274 250 -24 > __kmalloc_node_track_caller 356 324 -32 > __kmalloc_node 340 308 -32 > __kmalloc 324 273 -51 > __kmalloc_track_caller 343 286 -57 > > > Qmempool notes: > --------------- > > On baseline kernel: > > Type:qmempool fastpath reuse SOFTIRQ Per elem: 33 cycles(tsc) 13.287 ns > - (measurement period time:0.398628965 sec time_interval:398628965) > - (invoke count:30000000 tsc_interval:996571541) > > Type:qmempool fastpath reuse BH-disable Per elem: 47 cycles(tsc) 19.180 ns > - (measurement period time:0.575425927 sec time_interval:575425927) > - (invoke count:30000000 tsc_interval:1438563781) > > qmempool_bench: N-pattern with 256 elements > > Type:qmempool alloc+free N-pattern Per elem: 62 cycles(tsc) 24.955 ns (step:0) > - (measurement period time:0.638871008 sec time_interval:638871008) > - (invoke count:25600000 tsc_interval:1597176303) > > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Sr. Network Kernel Developer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>