Ingo Molnar a écrit :
* David Miller <davem@xxxxxxxxxxxxx> wrote:
From: Ingo Molnar <mingo@xxxxxxx>
Date: Mon, 17 Nov 2008 10:06:48 +0100
* Rafael J. Wysocki <rjw@xxxxxxx> wrote:
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.
The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308
Subject : tbench regression on each kernel release from 2.6.22 -> 2.6.28
Submitter : Christoph Lameter <cl@xxxxxxxxxxxxxxxxxxxx>
Date : 2008-08-11 18:36 (98 days old)
References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4
http://marc.info/?l=linux-kernel&m=122125737421332&w=4
Christoph, as per the recent analysis of Mike:
http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
all scheduler components of this regression have been eliminated.
In fact his numbers show that scheduler speedups since 2.6.22 have
offset and hidden most other sources of tbench regression. (i.e. the
scheduler portion got 5% faster, hence it was able to offset a
slowdown of 5% in other areas of the kernel that tbench triggers)
Although I respect the improvements, wake_up() is still several
orders of magnitude slower than it was in 2.6.22 and wake_up() is at
the top of the profiles in tbench runs.
hm, several orders of magnitude slower? That contradicts Mike's
numbers and my own numbers and profiles as well: see below.
The scheduler's overhead barely even registers on a 16-way x86 system
i'm running tbench on. Here's the NMI profile during 64 threads tbench
on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:
Throughput 3437.65 MB/sec 64 procs
==================================
21570252 total
........
1494803 copy_user_generic_string
998232 sock_rfree
491471 tcp_ack
482405 ip_dont_fragment
470685 ip_local_deliver
436325 constant_test_bit [ called by napi_disable_pending() ]
375469 avc_has_perm_noaudit
347663 tcp_sendmsg
310383 tcp_recvmsg
300412 __inet_lookup_established
294377 system_call
286603 tcp_transmit_skb
251782 selinux_ip_postroute
236028 tcp_current_mss
235631 schedule
234013 netif_rx
229854 _local_bh_enable_ip
219501 tcp_v4_rcv
[ etc. - see full profile attached further below ]
Note that the scheduler does not even show up in the profile up to
entry #15!
I've also summarized NMI profiler output by major subsystems:
NET overhead (12603450/21570252): 58.43%
security overhead ( 1903598/21570252): 8.83%
usercopy overhead ( 1753617/21570252): 8.13%
sched overhead ( 1599406/21570252): 7.41%
syscall overhead ( 560487/21570252): 2.60%
IRQ overhead ( 555439/21570252): 2.58%
slab overhead ( 492421/21570252): 2.28%
timer overhead ( 226573/21570252): 1.05%
pagealloc overhead ( 192681/21570252): 0.89%
PID overhead ( 115123/21570252): 0.53%
VFS overhead ( 107926/21570252): 0.50%
pagecache overhead ( 62552/21570252): 0.29%
gtod overhead ( 38651/21570252): 0.18%
IDLE overhead ( 0/21570252): 0.00%
---------------------------------------------------------
left ( 1349494/21570252): 6.26%
The scheduler's functions are absolutely flat, and consistent with an
extreme context-switching rate of 1.35 million per second. The
scheduler can go up to about 20 million context switches per second on
this system:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
32 0 0 32229696 29308 649880 0 0 0 0 164135 20026853 24 76 0 0 0
32 0 0 32229752 29308 649880 0 0 0 0 164203 20032770 24 76 0 0 0
32 0 0 32229752 29308 649880 0 0 0 0 164201 20036492 25 75 0 0 0
... and 7% scheduling overhead is roughly consistent with 1.35/20.0.
Wake up affinities and data flow caching is just fine in this workload
- we've got scheduler statistics for that and they look good too.
It all looks like pure old-fashioned straight overhead in the
networking layer to me. Do we still touch the same global cacheline
for every localhost packet we process? Anything like that would show
up big time.
Yes we do, I find strange we dont see dst_release() in your NMI profile
I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387
net: make sure struct dst_entry refcount is aligned on 64 bytes)
(in net-next-2.6 tree)
to properly align struct dst_entry refcounter and got 4% speedup on tbench on my machine.
Small speedups too with commit ef711cf1d156428d4c2911b8c86c6ce90519dc45
(net: speedup dst_release())
Also on net-next-2.6, patches avoid dirtying last_rx on netdevices (loopback for example)
, it helps a lot tbench too.
--
To unsubscribe from this list: send the line "unsubscribe kernel-testers" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html