* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Mon, 17 Nov 2008, Eric Dumazet wrote: > > > Ingo Molnar a écrit : > > > > it gives a small speedup of ~1% on my box: > > > > > > before: Throughput 3437.65 MB/sec 64 procs > > > after: Throughput 3473.99 MB/sec 64 procs > > > > Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8" > > I think Ingo may have a Nehalem. Let's just say that those things > rock, and have rather good memory throughput. hm, i'm not sure whether i can post benchmarks from the Nehalem box - but i can confirm it in general terms that it's rather nice ;-) This was run on another testbox (4x4 Barcelona) that rocks similarly well in terms of memory subsystem latencies: which seems to be tbench's main current critical path. For the tbench bragging rights i'd probably turn off CONFIG_SECURITY and a few other options. Plus i'd run with 16 threads only - in this test i ran with 4x overload (64 tbench threads, not 16) to stress the scheduler harder. Although we degrade very gently with overload so the numbers arent all that much different: 16 threads: Throughput 3463.14 MB/sec 16 procs 64 threads: Throughput 3473.99 MB/sec 64 procs 256 threads: Throughput 3457.67 MB/sec 256 procs 1024 threads: Throughput 3448.85 MB/sec 1024 procs [ so it's the same within noise range. ] 1024 threads is already a massive 64x overload so beyond any reasonable limit of workload sanity. Which suggests that the main limitation factor is cacheline ping-pong that is already in full effect at 16 threads. Which is supported by the "most expensive instructions" top-10 sorted list: RIP #hits .......................... [ usercopy ] ffffffff80350fcd: 1373300 f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi) ffffffff804a2f33: <sock_rfree>: ffffffff804a2f34: 985253 48 89 e5 mov %rsp,%rbp ffffffff804d2eb7: <ip_local_deliver>: ffffffff804d2eb8: 432659 48 89 e5 mov %rsp,%rbp ffffffff804aa23c: <constant_test_bit>: [ => napi_disable_pending() ] ffffffff804aa24c: 374052 89 d1 mov %edx,%ecx ffffffff804d5076: <ip_dont_fragment>: ffffffff804d5076: 310051 8a 97 56 02 00 00 mov 0x256(%rdi),%dl ffffffff804d9b17: <__inet_lookup_established>: ffffffff804d9bdf: 247224 eb ba jmp ffffffff804d9b9b <__inet_lookup_established+0x84> ffffffff80321529: <selinux_ip_postroute>: ffffffff8032152a: 183700 48 89 e5 mov %rsp,%rbp ffffffff8020c020: <system_call>: ffffffff8020c020: 183600 0f 01 f8 swapgs ffffffff8051884a: <netlbl_enabled>: ffffffff8051884a: 179538 55 push %rbp The usual profiling caveat applies: it's not _these_ instructions that matter, but the surrounding code that calls them. Profiling overhead is delayed by a couple of instructions - the more out-of-order a CPU is, the larger this delay can be. But even a quick look to the list above shows that all of the heavy cachemisses are generated by networking. Beyond the usual suspects of syscall entry and memcpy, it's only networking. We dont even have the mov %cr3 TLB flush overhead in this list, load_cr3() is a distant #30: ffffffff8023049f: 0 0f 22 d8 mov %rax,%cr3 ffffffff802304a2: 126303 c9 leaveq The place for the sock_rfree() hit looks a bit weird, and i'll investigate it now a bit more to place the real overhead point properly. (i already mapped the test-bit overhead: that comes from napi_disable_pending()) The first entry is 10x the cost of the last entry in the list so clearly we've got 1-2 brutal cacheline ping-pongs that dominate the overhead of this workload. Ingo -- To unsubscribe from this list: send the line "unsubscribe kernel-testers" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html