Re: md raid performance with 3-18-rc3

Manish Awasthi <manish.awasthi@xxxxxxxxxxxxxxxxxx> · Tue, 9 Dec 2014 13:56:00 +0530

this time with attachment:

manish
On 12/09/2014 01:54 PM, Manish Awasthi wrote:
resending:

 dirty_ratio same for both the kernels.

vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500

I re-ran the tests with the same set of kernel without enabling 
multithread support on 3.18 and measured a few things with perf.

perf-stat-<kernel>.txt: test ran for some time and measured various 
parameters.

Meanwhile I'm also running complete test under perf record. I'll 
share the results soon.

Manish

On 12/03/2014 11:51 AM, NeilBrown wrote:
On Wed, 26 Nov 2014 13:41:39 +0530 Manish Awasthi
<manish.awasthi@xxxxxxxxxxxxxxxxxx>  wrote:

Whatever data I have on comparison is attached, I have consolidated 
this
from log files to excel. See if this helps.
raid_3_18_performance.xls shows read throughput to be consistently 
20% down
on 3.18 compared to 3.6.11.

Writes are a few percent better for 4G/8G files, 20% better for 
16G/32G files.
unchanged above that.
Given that you have 8G of RAM, that seems like it could be some 
change in
caching behaviour, and not necessarily a change in RAID behaviour.

The CPU utilization roughly follows the throughput: 40% higher when 
write
throughput is 20% better.
Could you check if the value of /proc/sys/vm/dirty_ratio is the same 
for both
tests.  That number has changed occasionally and could affect these 
tests.

The second file, 3SSDs-perf-2-Cores-3.18-rc1 has the "change" numbers
negative where I expected positive.. i.e. negative mean an increase.

Writes consistently have higher CPU utilisation.
Reads consistently have much lower CPU utilization.

I don't know what that means ... it might not mean anything.

Could you please run the tests between the two kernels *with* RAID.  
i.e.
directly on an SSD.  That will give us a baseline for what changes 
are caused
by other parts of the kernel (filesystem, block layer, MM, etc).  
Then we can
see how much change RAID5 is contributing.

The third file, 3SSDs-perf-4Core.xls seems to show significantly 
reduced
throughput across the board.
CPU utilization is less (better) for writes, but worse for reads.  
That is
the reverse of what the second file shows.

I might try running some tests across a set of kernel versions and 
see what I
can come up with.

NeilBrown

perf stat on md125_raid5 -- kernel 3.6.11

# perf stat -p 2613 -e cycles,instructions,cache-references,cache-misses,branches,branch-misses,bus-cycles,stalled-cycles-frontend,ref-cycles,cpu-clock,task-clock,faults,context-switches,cpu-migrations,minor-faults,major-faults,alignment-faults,emulation-faults,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,LLC-loads,LLC-stores,LLC-prefetches,dTLB-load-misses,dTLB-store-misses,iTLB-loads,iTLB-load-misses,branch-loads,branch-load-misses
^C 
 Performance counter stats for process id '2613':

   103,200,677,721      cycles                    #    2.848 GHz                     [22.72%]
    69,669,813,983      instructions              #    0.68  insns per cycle        
                                                  #    1.07  stalled cycles per insn [27.26%]
     2,668,465,769      cache-references          #   73.648 M/sec                   [27.35%]
     1,408,493,680      cache-misses              #   52.783 % of all cache refs     [27.17%]
    13,609,211,321      branches                  #  375.607 M/sec                   [27.19%]
       121,593,598      branch-misses             #    0.89% of all branches         [27.32%]
     3,420,725,359      bus-cycles                #   94.410 M/sec                   [18.07%]
    74,362,368,252      stalled-cycles-frontend   #   72.06% frontend cycles idle    [18.16%]
   112,553,945,650      ref-cycles                # 3106.427 M/sec                   [22.76%]
      36233.766411      cpu-clock (msec)                                            
      36232.605499      task-clock (msec)         #    0.181 CPUs utilized          
                 0      faults                    #    0.000 K/sec                  
           442,885      context-switches          #    0.012 M/sec                  
             9,646      cpu-migrations            #    0.266 K/sec                  
                 0      minor-faults              #    0.000 K/sec                  
                 0      major-faults              #    0.000 K/sec                  
                 0      alignment-faults          #    0.000 K/sec                  
                 0      emulation-faults          #    0.000 K/sec                  
     3,188,865,936      L1-dcache-load-misses     #   88.011 M/sec                   [22.96%]
     1,658,831,957      L1-dcache-store-misses    #   45.783 M/sec                   [22.89%]
       338,744,029      L1-dcache-prefetch-misses #    9.349 M/sec                   [23.04%]
       445,066,995      L1-icache-load-misses     #   12.284 M/sec                   [22.99%]
     1,578,067,225      LLC-loads                 #   43.554 M/sec                   [18.19%]
     1,317,822,999      LLC-stores                #   36.371 M/sec                   [18.23%]
       798,004,610      LLC-prefetches            #   22.024 M/sec                   [ 9.09%]
                 0      dTLB-load-misses          #    0.000 K/sec                   [13.52%]
         7,633,236      dTLB-store-misses         #    0.211 M/sec                   [18.03%]
        10,024,464      iTLB-loads                #    0.277 M/sec                   [17.92%]
         3,157,141      iTLB-load-misses          #   31.49% of all iTLB cache hits  [18.12%]
    13,616,857,645      branch-loads              #  375.818 M/sec                   [18.16%]
       119,250,450      branch-load-misses        #    3.291 M/sec                   [18.14%]

     200.190181623 seconds time elapsed

perf stat on md125_raid5 -- kernel 3.18

# perf stat -p 2778 -e cycles,instructions,cache-references,cache-misses,branches,branch-misses,bus-cycles,stalled-cycles-frontend,ref-cycles,cpu-clock,task-clock,faults,context-switches,cpu-migrations,minor-faults,major-faults,alignment-faults,emulation-faults,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,LLC-loads,LLC-stores,LLC-prefetches,dTLB-load-misses,dTLB-store-misses,iTLB-loads,iTLB-load-misses,branch-loads,branch-load-misses
^C
 Performance counter stats for process id '2778':

   191,212,778,981      cycles                    #    2.942 GHz                     [22.99%]
   160,318,628,367      instructions              #    0.84  insns per cycle        
                                                  #    0.77  stalled cycles per insn [27.49%]
     3,800,688,695      cache-references          #   58.485 M/sec                   [27.40%]
     1,418,431,693      cache-misses              #   37.320 % of all cache refs     [27.27%]
    33,635,552,951      branches                  #  517.586 M/sec                   [27.12%]
       352,264,516      branch-misses             #    1.05% of all branches         [27.19%]
     6,035,806,867      bus-cycles                #   92.879 M/sec                   [18.21%]
   122,980,401,285      stalled-cycles-frontend   #   64.32% frontend cycles idle    [18.16%]
   197,829,618,312      ref-cycles                # 3044.216 M/sec                   [22.72%]
      65039.738267      cpu-clock (msec)                                            
      64985.415568      task-clock (msec)         #    0.186 CPUs utilized          
                 0      faults                    #    0.000 K/sec                  
         3,437,945      context-switches          #    0.053 M/sec                  
               237      cpu-migrations            #    0.004 K/sec                  
                 0      minor-faults              #    0.000 K/sec                  
                 0      major-faults              #    0.000 K/sec                  
                 0      alignment-faults          #    0.000 K/sec                  
                 0      emulation-faults          #    0.000 K/sec                  
     5,329,711,939      L1-dcache-load-misses     #   82.014 M/sec                   [22.83%]
     2,138,400,107      L1-dcache-store-misses    #   32.906 M/sec                   [22.52%]
       667,646,968      L1-dcache-prefetch-misses #   10.274 M/sec                   [22.48%]
     2,259,425,830      L1-icache-load-misses     #   34.768 M/sec                   [22.45%]
     2,090,596,777      LLC-loads                 #   32.170 M/sec                   [17.93%]
     1,679,287,271      LLC-stores                #   25.841 M/sec                   [18.04%]
     1,120,086,147      LLC-prefetches            #   17.236 M/sec                   [ 9.09%]
       465,142,622      dTLB-load-misses          #    7.158 M/sec                   [13.69%]
        26,672,298      dTLB-store-misses         #    0.410 M/sec                   [18.26%]
        66,723,475      iTLB-loads                #    1.027 M/sec                   [18.37%]
         9,736,729      iTLB-load-misses          #   14.59% of all iTLB cache hits  [18.43%]
    33,238,082,664      branch-loads              #  511.470 M/sec                   [18.44%]
       346,025,993      branch-load-misses        #    5.325 M/sec                   [18.46%]

     348.946853958 seconds time elapsed