On Thu, 1 Dec 2016 23:47:44 +0100 Hannes Frederic Sowa <hannes@xxxxxxxxxxxxxxxxxxx> wrote: > Side note: > > On 01.12.2016 20:51, Tom Herbert wrote: > >> > E.g. "mini-skb": Even if we assume that this provides a speedup > >> > (where does that come from? should make no difference if a 32 or > >> > 320 byte buffer gets allocated). Yes, the size of the allocation from the SLUB allocator does not change base performance/cost much (at least for small objects, if < 1024). Do notice the base SLUB alloc+free cost is fairly high (compared to a 201 cycles budget). Especially for networking as the free-side is very likely to hit a slow path. SLUB fast-path 53 cycles, and slow-path around 100 cycles (data from [1]). I've tried to address this with the kmem_cache bulk APIs. Which reduce the cost to approx 30 cycles. (Something we have not fully reaped the benefit from yet!) [1] https://git.kernel.org/torvalds/c/ca257195511 > >> > > > It's the zero'ing of three cache lines. I believe we talked about that > > as netdev. Actually 4 cache-lines, but with some cleanup I believe we can get down to clearing 192 bytes 3 cache-lines. > > Jesper and me played with that again very recently: > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590 > > In micro-benchmarks we saw a pretty good speed up not using the rep > stosb generated by gcc builtin but plain movq's. Probably the cost model > for __builtin_memset in gcc is wrong? Yes, I believe so. > When Jesper is free we wanted to benchmark this and maybe come up with a > arch specific way of cleaning if it turns out to really improve throughput. > > SIMD instructions seem even faster but the kernel_fpu_begin/end() kill > all the benefits. One strange thing was, that on my skylake CPU (i7-6700K @4.00GHz), Hannes's hand-optimized MOVQ ASM-code didn't go past 8 bytes per cycle, or 32 cycles for 256 bytes. Talking to Alex and John during netdev, and reading on the Intel arch, I though that this CPU should be-able-to perform 16 bytes per cycle. The CPU can do it as the rep-stos show this once the size gets large enough. On this CPU the memset rep stos starts to win around 512 bytes: 192/35 = 5.5 bytes/cycle 256/36 = 7.1 bytes/cycle 512/40 = 12.8 bytes/cycle 768/46 = 16.7 bytes/cycle 1024/52 = 19.7 bytes/cycle 2048/84 = 24.4 bytes/cycle 4096/148= 27.7 bytes/cycle -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>