Sorry for the late response.
The interrupts apply only to the receive side. Any CPU may put data into the
qdisc and I think any CPU may take data of the qdisc and send it. Have you set
the process affinity so that the sending process runs on the same CPU the
interrupt is raised on?
Not in this case. I have 4 web server processes running, each pinned to a different CPU.
The interrupts are also pinned, such that interrupts for a specific NIC occur only on the
CPU handling the data relevant to that web server process.
A cache line may contain more than a single spinlock. If any other data in
the same cache line is accessed or modified, the cache line will not be in
modified state.
Yes, this may be a case of false sharing, I'll check that.
This and the lockstat seem tells that there is no lock contention. Can you
send some oprofile data?
Here are the top functions with respect to global_power_events in the 3 processor case:
samples % samples % samples % samples % app name
symbol name
141924 9.3228 182254 11.5263 178498 11.2869 621626 38.7608 vmlinux
cpu_idle
138963 9.1283 175842 11.1208 172319 10.8962 588353 36.6861 vmlinux
poll_idle
79616 5.2299 103271 6.5312 101016 6.3875 338251 21.0913 vmlinux
__rcu_process_callbacks
43093 2.8307 42444 2.6843 42888 2.7119 0 0 e1000.ko
e1000_xmit_frame
42270 2.7767 42223 2.6703 43035 2.7212 0 0 vmlinux
__inet_check_established
39980 2.6262 40492 2.5608 39348 2.4881 2 1.2e-04 e1000.ko
e1000_clean_rx_irq
36378 2.3896 35166 2.2240 36595 2.3140 0 0 vmlinux
tcp_ack
and the same for the 4 processor case:
samples % samples % samples % samples % image name
app name symbol name
64036 4.1148 65463 4.0258 63911 3.9295 63304 3.8920 e1000.ko
e1000.ko e1000_clean_rx_irq
62808 4.0359 64516 3.9675 64566 3.9698 63963 3.9325 vmlinux
vmlinux __inet_check_established
61813 3.9720 62271 3.8295 61574 3.7858 61609 3.7878 e1000.ko
e1000.ko e1000_xmit_frame
47354 3.0429 49379 3.0366 49729 3.0575 49858 3.0653 vmlinux
vmlinux tcp_ack
44598 2.8658 43340 2.6653 43713 2.6876 43097 2.6496 vmlinux
vmlinux sys_epoll_ctl
32774 2.1060 35096 2.1583 32988 2.0282 35085 2.1571 vmlinux
vmlinux do_sendfile
30658 1.9700 31268 1.9229 31685 1.9481 31291 1.9238 vmlinux
vmlinux tcp_transmit_skb
As you can see, there is considerable per-processor idle time in the 3 processor case,
that almost disappears when moving to 4 processors. If we take e1000_xmit_frame as an
example, its average time share rises from ~2.7% per processor, to ~3.8%.
The following is the most interesting part of this function's annotated assembly code:
26 0.0603 31 0.0730 22 0.0513 0 0 : 53ea: call
ff0 <e1000_maybe_stop_tx>
117 0.2715 130 0.3063 116 0.2705 0 0 : 53ef: mov
0x1c(%esp),%eax
311 0.7217 260 0.6126 191 0.4453 0 0 : 53f3: mov
0x34(%esp),%edx
0 0 4 0.0094 7 0.0163 0 0 : 53f7: call
53f8 <e1000_xmit_frame+0x808>
19227 44.6175 19109 45.0217 19229 44.8354 0 0 : 53fc: xor
%eax,%eax
175 0.4061 169 0.3982 171 0.3987 0 0 : 53fe: add
$0x7c,%esp
61 0.1416 44 0.1037 48 0.1119 0 0 : 5401: pop %ebx
The instruction following a call to spin_unlock_irqrestore() accounts for 19227 sampled
events out of a total of 43093 for the entire function (I'm only citing the numbers for
the first processor, but the rest are roughly the same). When using all 4 processor, we
end up with the following:
33 0.0534 29 0.0466 32 0.0520 39 0.0633 : 53ea: call
ff0 <e1000_maybe_stop_tx>
182 0.2944 187 0.3003 207 0.3362 155 0.2516 : 53ef: mov
0x1c(%esp),%eax
440 0.7118 521 0.8367 338 0.5489 416 0.6752 : 53f3: mov
0x34(%esp),%edx
5 0.0081 6 0.0096 3 0.0049 1 0.0016 : 53f7: call
53f8 <e1000_xmit_frame+0x808>
35568 57.5413 35241 56.5930 34853 56.6034 35700 57.9461 : 53fc: xor
%eax,%eax
187 0.3025 209 0.3356 207 0.3362 206 0.3344 : 53fe: add
$0x7c,%esp
63 0.1019 69 0.1108 53 0.0861 68 0.1104 : 5401: pop %ebx
A dramatic increase to 35568 events out of a total of 61813. Now, the number of
per-processor L2 cache misses for this function is exactly the same in the 3 and 4
processor scenarios. On the other hand, the number of FSB read/write events grows from
2029 to 2978, for which more than half are accredited to the xor instruction that follows
the call to spin_unlock_irqrestore().
--Elad
--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ