Re: Locks and the FSB

Elad Lahav <elad_lahav@xxxxxxxxxxxxxxxxxxxxx> · Mon, 01 Dec 2008 10:50:10 -0500

Sorry for the late response.

The interrupts apply only to the receive side. Any CPU may put data into the
qdisc and I think any CPU may take data of the qdisc and send it. Have you set
the process affinity so that the sending process runs on the same CPU the
interrupt is raised on?
Not in this case. I have 4 web server processes running, each pinned to a different CPU. 
The interrupts are also pinned, such that interrupts for a specific NIC occur only on the 
CPU handling the data relevant to that web server process.

A cache line may contain more than a single spinlock. If any other data in
the same cache line is accessed or modified, the cache line will not be in
modified state.
Yes, this may be a case of false sharing, I'll check that.

This and the lockstat seem tells that there is no lock contention. Can you
send some oprofile data?
Here are the top functions with respect to global_power_events in the 3 processor case:

samples  %        samples  %        samples  %        samples  %        app name 
       symbol name
141924    9.3228  182254   11.5263  178498   11.2869  621626   38.7608  vmlinux 
       cpu_idle
138963    9.1283  175842   11.1208  172319   10.8962  588353   36.6861  vmlinux 
       poll_idle
79616     5.2299  103271    6.5312  101016    6.3875  338251   21.0913  vmlinux 
       __rcu_process_callbacks
43093     2.8307  42444     2.6843  42888     2.7119  0              0  e1000.ko 
       e1000_xmit_frame
42270     2.7767  42223     2.6703  43035     2.7212  0              0  vmlinux 
       __inet_check_established
39980     2.6262  40492     2.5608  39348     2.4881  2        1.2e-04  e1000.ko 
       e1000_clean_rx_irq
36378     2.3896  35166     2.2240  36595     2.3140  0              0  vmlinux 
       tcp_ack

and the same for the 4 processor case:

samples  %        samples  %        samples  %        samples  %        image name 
       app name                 symbol name
64036     4.1148  65463     4.0258  63911     3.9295  63304     3.8920  e1000.ko 
       e1000.ko                 e1000_clean_rx_irq
62808     4.0359  64516     3.9675  64566     3.9698  63963     3.9325  vmlinux 
       vmlinux                  __inet_check_established
61813     3.9720  62271     3.8295  61574     3.7858  61609     3.7878  e1000.ko 
       e1000.ko                 e1000_xmit_frame
47354     3.0429  49379     3.0366  49729     3.0575  49858     3.0653  vmlinux 
       vmlinux                  tcp_ack
44598     2.8658  43340     2.6653  43713     2.6876  43097     2.6496  vmlinux 
       vmlinux                  sys_epoll_ctl
32774     2.1060  35096     2.1583  32988     2.0282  35085     2.1571  vmlinux 
       vmlinux                  do_sendfile
30658     1.9700  31268     1.9229  31685     1.9481  31291     1.9238  vmlinux 
       vmlinux                  tcp_transmit_skb

As you can see, there is considerable per-processor idle time in the 3 processor case, 
that almost disappears when moving to 4 processors. If we take e1000_xmit_frame as an 
example, its average time share rises from ~2.7% per processor, to ~3.8%.

The following is the most interesting part of this function's annotated assembly code:

    26  0.0603    31  0.0730    22  0.0513     0       0       :    53ea:       call 
ff0 <e1000_maybe_stop_tx>
   117  0.2715   130  0.3063   116  0.2705     0       0       :    53ef:       mov 
0x1c(%esp),%eax
   311  0.7217   260  0.6126   191  0.4453     0       0       :    53f3:       mov 
0x34(%esp),%edx
     0       0     4  0.0094     7  0.0163     0       0       :    53f7:       call 
53f8 <e1000_xmit_frame+0x808>
 19227 44.6175 19109 45.0217 19229 44.8354     0       0       :    53fc:       xor 
%eax,%eax
   175  0.4061   169  0.3982   171  0.3987     0       0       :    53fe:       add 
$0x7c,%esp
    61  0.1416    44  0.1037    48  0.1119     0       0       :    5401:       pop    %ebx

The instruction following a call to spin_unlock_irqrestore() accounts for 19227 sampled 
events out of a total of 43093 for the entire function (I'm only citing the numbers for 
the first processor, but the rest are roughly the same). When using all 4 processor, we 
end up with the following:

    33  0.0534    29  0.0466    32  0.0520    39  0.0633       :    53ea:       call 
ff0 <e1000_maybe_stop_tx>
   182  0.2944   187  0.3003   207  0.3362   155  0.2516       :    53ef:       mov 
0x1c(%esp),%eax
   440  0.7118   521  0.8367   338  0.5489   416  0.6752       :    53f3:       mov 
0x34(%esp),%edx
     5  0.0081     6  0.0096     3  0.0049     1  0.0016       :    53f7:       call 
53f8 <e1000_xmit_frame+0x808>
 35568 57.5413 35241 56.5930 34853 56.6034 35700 57.9461       :    53fc:       xor 
%eax,%eax
   187  0.3025   209  0.3356   207  0.3362   206  0.3344       :    53fe:       add 
$0x7c,%esp
    63  0.1019    69  0.1108    53  0.0861    68  0.1104       :    5401:       pop    %ebx

A dramatic increase to 35568 events out of a total of 61813. Now, the number of 
per-processor L2 cache misses for this function is exactly the same in the 3 and 4 
processor scenarios. On the other hand, the number of FSB read/write events grows from 
2029 to 2978, for which more than half are accredited to the xor instruction that follows 
the call to spin_unlock_irqrestore().

--Elad

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ