Re: Locks and the FSB

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for the late response.

The interrupts apply only to the receive side. Any CPU may put data into the
qdisc and I think any CPU may take data of the qdisc and send it. Have you set
the process affinity so that the sending process runs on the same CPU the
interrupt is raised on?
Not in this case. I have 4 web server processes running, each pinned to a different CPU. The interrupts are also pinned, such that interrupts for a specific NIC occur only on the CPU handling the data relevant to that web server process.

A cache line may contain more than a single spinlock. If any other data in
the same cache line is accessed or modified, the cache line will not be in
modified state.
Yes, this may be a case of false sharing, I'll check that.

This and the lockstat seem tells that there is no lock contention. Can you
send some oprofile data?
Here are the top functions with respect to global_power_events in the 3 processor case:

samples % samples % samples % samples % app name symbol name 141924 9.3228 182254 11.5263 178498 11.2869 621626 38.7608 vmlinux cpu_idle 138963 9.1283 175842 11.1208 172319 10.8962 588353 36.6861 vmlinux poll_idle 79616 5.2299 103271 6.5312 101016 6.3875 338251 21.0913 vmlinux __rcu_process_callbacks 43093 2.8307 42444 2.6843 42888 2.7119 0 0 e1000.ko e1000_xmit_frame 42270 2.7767 42223 2.6703 43035 2.7212 0 0 vmlinux __inet_check_established 39980 2.6262 40492 2.5608 39348 2.4881 2 1.2e-04 e1000.ko e1000_clean_rx_irq 36378 2.3896 35166 2.2240 36595 2.3140 0 0 vmlinux tcp_ack

and the same for the 4 processor case:

samples % samples % samples % samples % image name app name symbol name 64036 4.1148 65463 4.0258 63911 3.9295 63304 3.8920 e1000.ko e1000.ko e1000_clean_rx_irq 62808 4.0359 64516 3.9675 64566 3.9698 63963 3.9325 vmlinux vmlinux __inet_check_established 61813 3.9720 62271 3.8295 61574 3.7858 61609 3.7878 e1000.ko e1000.ko e1000_xmit_frame 47354 3.0429 49379 3.0366 49729 3.0575 49858 3.0653 vmlinux vmlinux tcp_ack 44598 2.8658 43340 2.6653 43713 2.6876 43097 2.6496 vmlinux vmlinux sys_epoll_ctl 32774 2.1060 35096 2.1583 32988 2.0282 35085 2.1571 vmlinux vmlinux do_sendfile 30658 1.9700 31268 1.9229 31685 1.9481 31291 1.9238 vmlinux vmlinux tcp_transmit_skb

As you can see, there is considerable per-processor idle time in the 3 processor case, that almost disappears when moving to 4 processors. If we take e1000_xmit_frame as an example, its average time share rises from ~2.7% per processor, to ~3.8%.

The following is the most interesting part of this function's annotated assembly code:

26 0.0603 31 0.0730 22 0.0513 0 0 : 53ea: call ff0 <e1000_maybe_stop_tx> 117 0.2715 130 0.3063 116 0.2705 0 0 : 53ef: mov 0x1c(%esp),%eax 311 0.7217 260 0.6126 191 0.4453 0 0 : 53f3: mov 0x34(%esp),%edx 0 0 4 0.0094 7 0.0163 0 0 : 53f7: call 53f8 <e1000_xmit_frame+0x808> 19227 44.6175 19109 45.0217 19229 44.8354 0 0 : 53fc: xor %eax,%eax 175 0.4061 169 0.3982 171 0.3987 0 0 : 53fe: add $0x7c,%esp
    61  0.1416    44  0.1037    48  0.1119     0       0       :    5401:       pop    %ebx

The instruction following a call to spin_unlock_irqrestore() accounts for 19227 sampled events out of a total of 43093 for the entire function (I'm only citing the numbers for the first processor, but the rest are roughly the same). When using all 4 processor, we end up with the following:

33 0.0534 29 0.0466 32 0.0520 39 0.0633 : 53ea: call ff0 <e1000_maybe_stop_tx> 182 0.2944 187 0.3003 207 0.3362 155 0.2516 : 53ef: mov 0x1c(%esp),%eax 440 0.7118 521 0.8367 338 0.5489 416 0.6752 : 53f3: mov 0x34(%esp),%edx 5 0.0081 6 0.0096 3 0.0049 1 0.0016 : 53f7: call 53f8 <e1000_xmit_frame+0x808> 35568 57.5413 35241 56.5930 34853 56.6034 35700 57.9461 : 53fc: xor %eax,%eax 187 0.3025 209 0.3356 207 0.3362 206 0.3344 : 53fe: add $0x7c,%esp
    63  0.1019    69  0.1108    53  0.0861    68  0.1104       :    5401:       pop    %ebx

A dramatic increase to 35568 events out of a total of 61813. Now, the number of per-processor L2 cache misses for this function is exactly the same in the 3 and 4 processor scenarios. On the other hand, the number of FSB read/write events grows from 2029 to 2978, for which more than half are accredited to the xor instruction that follows the call to spin_unlock_irqrestore().

--Elad

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ


[Index of Archives]     [Newbies FAQ]     [Linux Kernel Mentors]     [Linux Kernel Development]     [IETF Annouce]     [Git]     [Networking]     [Security]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux SCSI]     [Linux ACPI]
  Powered by Linux