Re: Growing RAID5 SSD Array

Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> · Sun, 06 Apr 2014 05:25:10 +1000

On 26/03/14 07:31, Stan Hoeppner wrote:
On 3/25/2014 8:10 AM, Adam Goryachev wrote:
I'll respond to the other email later on, but in between, I've found something else that seems just plain wrong.

So, right now, I've shutdown most of the VM's (just one Linux VM left, which should be mostly idle since it is after 11pm local time). I'm trying to create a duplicate copy of one LV to another as a backup (in case I mess it up). So, I've shutdown DRBD, so we are operating independently (not that there is any change if DRBD is connected), I'm running on the storage server itself (so no iscsi or network issues).

So, two LV's:
   LV VG   Attr     LSize   Pool Origin Data%  Move Log Copy%  Convert
   backup_xptserver1_d1_20140325_224311 vg0  -wi-ao-- 453.00g
   xptserver1_d1                                              vg0 -wi-ao-- 452.00g
So you're copying 452 GB of raw bytes from one LV to another.

running the command:
This is part of the problem:
dd if=/dev/vg0/xptserver1_d1 of=/dev/vg0/backup_xptserver1_d1_20140325_224311
Using the dd defaults of buffered IO and 512 byte block size is horribly inefficient when copying 452 GB of data, especially to SSD.  Buffered IO consumes 904 GB of extra memory bandwidth.  Using 512 byte IOs requires much work of the raid5 write thread and more stripe cache bandwidth.  Use this instead:

dd if=/dev/vg0/xxx of=/dev/vg0/yyy iflag=direct oflag=direct bs=1536k

This eliminates 904 GB of RAM b/w in memcpy's and writes out to the block layer in 1.5 MB IOs, i.e. four full stripes.  This decreases the amount of work required of md as it receives 4 stripes of ligned IO at once, instead of 512 byte IOs which it must assemble.

Yes, of course, I should have known better! What a waste of three hours 
or so....
from another shell I run:
while pidof dd > /dev/null;do kill -USR1 `pidof dd`;sleep 10;done

dd shows this output:
99059692032 bytes (99 GB) copied, 2515.43 s, 39.4 MB/s
99403235840 bytes (99 GB) copied, 2525.45 s, 39.4 MB/s
99817538048 bytes (100 GB) copied, 2535.47 s, 39.4 MB/s
100252660224 bytes (100 GB) copied, 2545.49 s, 39.4 MB/s
Yes, that is very low, worse than single rust.  Using the dd options above should bump this up substantially.  However I have read claims that LVM2 over md tends to decrease performance.  I'm still looking into that for verification.

When you performed the in depth FIO testing last year with the job files I provided, was the target the md RAID device or an LV?

I'm certain that it was against an LV on DRBD on MD RAID5, while the 
DRBD was disconnected.

iostat -dmx 1 shows this output:

sda - sdg are the RAID5 SSD drives, single partition, used by md only
dm-8 is the source for the dd copy
dm-17 is the destination of the dd copy,
dm-12 is the Linux VM which is currently running...

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg             957.00  6767.00  930.00  356.00     8.68    27.65 57.85     0.65    0.50    0.16    1.39   0.37  48.00
sdd             956.00  6774.00  921.00  313.00     8.69    27.50 60.06     0.26    0.21    0.08    0.60   0.17  20.80
sda             940.00  6781.00  927.00  326.00     8.65    27.57 59.20     0.28    0.22    0.09    0.60   0.17  20.80
sdf             967.00  6768.00  927.00  320.00     8.70    27.50 59.46     0.29    0.23    0.12    0.55   0.16  20.00
sde             943.00  6770.00  933.00  369.00     8.69    27.71 57.26     0.74    0.57    0.16    1.60   0.44  57.20
sdc             983.00  6790.00  937.00  317.00     8.86    27.55 59.46     1.58    1.27    0.71    2.90   0.49  61.60
sdb             966.00  6813.00  929.00  313.00     8.76    27.57 59.92     1.20    0.97    0.34    2.84   0.49  61.20
                   ^^^^^^^ ^^^^^^^
Note the difference between read merges and write merges, about 7:1, whereas the bandwidth is about 3:1.  That's about 7K read merges/s and 48K write merges/s.  Telling dd to use 1.5 MB IOs should reduce merges significantly, increasing throughout by a non negligible amount.  It should also decrease %util substantially, as less CPU time is required for merging, and less for md to assemble stripes from tiny 512 byte writes.

md1               0.00     0.00 12037.00 42030.00    56.42 164.04     8.35     0.00    0.00    0.00    0.00   0.00   0.00
drbd2             0.00     0.00 12034.00 41989.00    56.41 164.02     8.36   177.73    3.31    0.46    4.13   0.02  91.60
dm-8              0.00     0.00 5955.00    0.00    23.26 0.00     8.00     4.43    0.74    0.74    0.00   0.01   6.40
dm-12             0.00     0.00  254.00    5.00    10.39     0.02 82.38     0.28    1.08    1.01    4.80   0.59  15.20
dm-17             0.00     0.00 5813.00 41984.00    22.71 164.00     8.00   174.87    3.65    0.15    4.13   0.02 100.00
...
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg            1472.00     0.00 1681.00    0.00    14.70     0.00 17.90     0.14    0.08    0.08    0.00   0.08  13.60
sdd            1472.00     0.00 1668.00    0.00    14.64     0.00 17.98     0.12    0.07    0.07    0.00   0.07  11.20
sda            1472.00     0.00 1673.00    0.00    14.66     0.00 17.95     0.12    0.07    0.07    0.00   0.07  11.60
sdf            1472.00     0.00 1680.00    0.00    14.69     0.00 17.91     0.13    0.08    0.08    0.00   0.07  12.40
sde            1472.00     0.00 1685.00    0.00    14.71     0.00 17.88     0.12    0.07    0.07    0.00   0.07  11.60
sdc            1478.00     0.00 1687.00    0.00    14.72     0.00 17.87     0.12    0.07    0.07    0.00   0.07  11.20
sdb            1487.00     0.00 1679.00    0.00    14.69     0.00 17.92     0.14    0.08    0.08    0.00   0.08  13.20
md1               0.00     0.00 22182.00    0.00   103.29 0.00     9.54     0.00    0.00    0.00    0.00   0.00   0.00
drbd2             0.00     0.00 22244.00    0.00   103.66 0.00     9.54     5.76    0.26    0.26    0.00   0.03  59.60
dm-8              0.00     0.00 10945.00    0.00    42.75 0.00     8.00     5.74    0.50    0.50    0.00   0.00   4.00
dm-12             0.00     0.00  446.00    0.00    18.51     0.00 84.99     0.07    0.15    0.15    0.00   0.07   3.20
dm-17             0.00     0.00 10836.00    0.00    42.33 0.00     8.00     0.58    0.05    0.05    0.00   0.05  57.60
No clue here.  You're reading exactly the same amount from the drives, drbd2, dm-8, and dm-17.  Given your description of a dd copy from dm-8 to dm-17, it seems odd that dm-8 and dm-17 are being read nearly the same number of bytes here, with no writes.
I've just double checked, definitely reading from dm-8 and writing to 
dm-17, since all the LV's are on DRBD, the total reads on the LV's 
should equal the reads on drbd2, same goes for writes. Also, values for 
drbd2 should (approx) equal md1, and the sum of sd[a-g]. I really have 
no idea who, what, or why there would be any reads on dm-17...

Another 15 seconds of 0.00 wMB/s on dm-17
These periods of no write activity suggest that your iostat timing didn't fully coincide with your dd copy.  If it's not that, then something is causing your write IO to stall entirely.  Any stack traces in dmesg?

Definitely not, the stats were collected and the email sent hours before 
the dd completed.... I only collected the stats for 76 seconds, the copy 
took around 4 hours...

Very interesting.... looking at log files can be at times :)
So, no stack traces etc in relation to this, however, just last night, 
the log started recording errors on the OS drive (sdh), some testing 
with dd shows that it is returning unreadable errors at 77551MB to 
77555MB. This first one works, the second fails:
dd if=/dev/sdh of=/dev/null bs=1M iflag=direct skip=77550 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00356682 s, 294 MB/s

dd if=/dev/sdh of=/dev/null bs=1M iflag=direct skip=77551 count=1
dd: reading `/dev/sdh': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.213353 s, 0.0 kB/s

This drive is on a different SATA controller (onboard) while all the 
rest of the drives are on the LSI SATA controller. I can read from the 
drive fine before 77551 and after 77556. I've just ordered a replacement 
drive, and will replace that tonight, then wait for the warranty 
replacement later. FYI, it's a Intel 120GB SSD.

I can't be sure, but I don't think this should have impacted on the 
copy, given that the OS isn't even in use generally, and wasn't the 
source/destination of the copy.

Actually, drive was replaced already....

In fact, the peak value is 180.00 and the minimum is 0.00, with a total of 44 seconds of 0.00 and 16seconds over 100.00 and 16 seconds between 0 and 100.

Here is a look at top -b -d 0.5 -n 60|grep ^\%Cpu
   95.9% --  %Cpu0  :  2.1 us, 29.2 sy,  0.0 ni,  4.2 id, 64.6 wa,  0.0 hi,  0.0 si,  0.0 st
   91.1% --  %Cpu0  :  0.0 us, 24.4 sy,  0.0 ni,  6.7 id, 66.7 wa,  0.0 hi,  2.2 si,  0.0 st
   82.9% --  %Cpu0  :  0.0 us, 25.5 sy,  0.0 ni, 14.9 id, 57.4 wa,  0.0 hi,  2.1 si,  0.0 st
   91.3% --  %Cpu0  :  2.2 us, 32.6 sy,  0.0 ni,  4.3 id, 56.5 wa,  0.0 hi,  4.3 si,  0.0 st
  100.0% --  %Cpu0  :  4.0 us, 42.0 sy,  0.0 ni,  0.0 id, 54.0 wa,  0.0 hi,  0.0 si,  0.0 st
  100.0% --  %Cpu0  :  2.2 us, 39.1 sy,  0.0 ni,  0.0 id, 58.7 wa,  0.0 hi,  0.0 si,  0.0 st
   93.5% --  %Cpu0  :  2.2 us, 34.8 sy,  0.0 ni,  4.3 id, 56.5 wa,  0.0 hi,  2.2 si,  0.0 st
It would appear that the raid5 write thread is being scheduled only on Cpu0, which is not good as core0 is the only core on this machine that processes interrupts.  Hardware interrupt load above is zero, but with a real disk and network throughput rate it will eat into the cycles needed by the RAID5 thread.
OK, I'm going to add the following to the /etc/rc.local:
for irq in `cat /proc/interrupts |grep mpt2sas| awk -F: '{ print $1}'`
do
        echo 4 > /proc/irq/${irq}/smp_affinity
done

That will move the LSI card interrupt processing to CPU2 like this:
  57:  143806142       7246      41052          0 IR-PCI-MSI-edge      
mpt2sas0-msix0
  58:   14381650          0      22952          0 IR-PCI-MSI-edge      
mpt2sas0-msix1
  59:    6733526          0     144387          0 IR-PCI-MSI-edge      
mpt2sas0-msix2
  60:    3342802          0      32053          0 IR-PCI-MSI-edge      
mpt2sas0-msix3

You can see I briefly moved one to CPU1 as well.

Would you suggest moving the eth devices to another CPU as well, perhaps 
CPU3 ?

The physical IO work does not seem to be spread very well across all 4 cores.  However, the data rates are so low here it's difficult to come to any conclusion.  Cores 1-2 are performing a little work, 5-10% or so.  If you present a workload with bare minimal optimization, removing the choke hold from md and the elevator, as in my dd example up above, I'm sure you'll see much more work done by the other cores, as there will be far more IO to process.
I'll run a bunch more tests tonight, and get a better idea. For now though:
dd if=/dev/vg0/xptest of=/dev/vg0/testing iflag=direct oflag=direct 
bs=1536k count=5k
iostat shows much more solid read and write rates, around 120MB/s peaks, 
dd reported 88MB/s, it also shows 0 for rrqm and wrqm, so no more 
merging was being done. The avgrq-sz value is always 128 for the 
destination, and almost always 128 for the source during the copy. This 
seems to equal 64kB, so I'm not sure why that is if we told dd to use 
1536k ...

top shows:
%Cpu0  :  0.0 us,  3.9 sy,  0.0 ni, 92.2 id,  2.0 wa,  0.0 hi,  2.0 si,  
0.0 st
%Cpu1  :  0.0 us,  2.0 sy,  0.0 ni, 94.1 id,  3.9 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  2.0 us,  8.2 sy,  0.0 ni, 75.5 id, 12.2 wa,  0.0 hi,  2.0 si,  
0.0 st
%Cpu3  :  0.0 us,  9.6 sy,  0.0 ni, 86.5 id,  3.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  : 19.2 us, 13.5 sy,  0.0 ni, 61.5 id,  5.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  : 21.6 us, 11.8 sy,  0.0 ni, 66.7 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  : 17.6 us, 19.6 sy,  0.0 ni, 51.0 id,  7.8 wa,  0.0 hi,  3.9 si,  
0.0 st
%Cpu3  : 19.6 us, 15.7 sy,  0.0 ni, 58.8 id,  5.9 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  2.0 sy,  0.0 ni, 91.8 id,  6.1 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  1.9 sy,  0.0 ni, 96.2 id,  1.9 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  2.0 us, 10.0 sy,  0.0 ni, 80.0 id,  8.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  2.0 us,  7.8 sy,  0.0 ni, 88.2 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  2.0 sy,  0.0 ni, 96.1 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 93.9 id,  6.1 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us, 10.0 sy,  0.0 ni, 76.0 id, 14.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  9.3 sy,  0.0 ni, 85.2 id,  5.6 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  3.9 sy,  0.0 ni, 92.2 id,  2.0 wa,  0.0 hi,  2.0 si,  
0.0 st
%Cpu1  :  0.0 us,  2.0 sy,  0.0 ni, 94.1 id,  3.9 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  1.9 us, 15.1 sy,  0.0 ni, 67.9 id, 15.1 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  5.9 sy,  0.0 ni, 84.3 id,  9.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  2.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us, 10.2 sy,  0.0 ni, 81.6 id,  8.2 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  6.1 sy,  0.0 ni, 85.7 id,  8.2 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  3.8 sy,  0.0 ni, 90.4 id,  5.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 88.2 id, 11.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us,  8.0 sy,  0.0 ni, 90.0 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  2.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 96.0 id,  4.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  2.0 us,  6.0 sy,  0.0 ni, 86.0 id,  6.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  9.3 sy,  0.0 ni, 75.9 id, 14.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  3.9 sy,  0.0 ni, 80.4 id, 15.7 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  3.8 sy,  0.0 ni, 90.4 id,  5.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  2.0 us,  3.9 sy,  0.0 ni, 92.2 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  2.0 sy,  0.0 ni, 94.1 id,  3.9 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  3.9 sy,  0.0 ni, 96.1 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  1.9 sy,  0.0 ni, 94.2 id,  3.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us, 10.4 sy,  0.0 ni, 79.2 id, 10.4 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  6.1 sy,  0.0 ni, 91.8 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  4.1 sy,  0.0 ni, 95.9 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  2.0 us,  2.0 sy,  0.0 ni, 94.1 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us, 12.0 sy,  0.0 ni, 76.0 id, 12.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us, 13.2 sy,  0.0 ni, 81.1 id,  5.7 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  2.0 us,  4.0 sy,  0.0 ni, 88.0 id,  6.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 96.0 id,  4.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us,  6.2 sy,  0.0 ni, 83.3 id, 10.4 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  7.7 sy,  0.0 ni, 84.6 id,  7.7 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  4.0 sy,  0.0 ni, 96.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  3.8 sy,  0.0 ni, 88.5 id,  7.7 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us,  6.4 sy,  0.0 ni, 87.2 id,  6.4 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  8.0 sy,  0.0 ni, 84.0 id,  8.0 wa,  0.0 hi,  0.0 si,  
0.0 st

So it looks like CPU0 is less busy, with more work being done on CPU2 
(the interrupts for the LSI SATA controller)

If I increase bs=6M then dd reports 130MB/s ...

Currently, there are no LVM snapshots at all, the raid array is in sync, operating normally:
md1 : active raid5 sdd1[7] sdb1[11] sdc1[10] sdf1[9] sdg1[5] sde1[8] sda1[6]
       2813087616 blocks super 1.2 level 5, 64k chunk, algorithm 2 [7/7] [UUUUUUU]

mdadm --detail /dev/md1
/dev/md1:
         Version : 1.2
   Creation Time : Wed Aug 22 00:47:03 2012
      Raid Level : raid5
      Array Size : 2813087616 (2682.77 GiB 2880.60 GB)
   Used Dev Size : 468847936 (447.13 GiB 480.10 GB)
    Raid Devices : 7
   Total Devices : 7
     Persistence : Superblock is persistent

     Update Time : Tue Mar 25 23:55:42 2014
           State : active
  Active Devices : 7
Working Devices : 7
  Failed Devices : 0
   Spare Devices : 0

          Layout : left-symmetric
      Chunk Size : 64K

            Name : san1:1  (local to host san1)
            UUID : 707957c0:b7195438:06da5bc4:485d301c
          Events : 1713337

     Number   Major   Minor   RaidDevice State
        7       8       49        0      active sync   /dev/sdd1
        6       8        1        1      active sync   /dev/sda1
        8       8       65        2      active sync   /dev/sde1
        5       8       97        3      active sync   /dev/sdg1
        9       8       81        4      active sync   /dev/sdf1
       10       8       33        5      active sync   /dev/sdc1
       11       8       17        6      active sync   /dev/sdb1

Also, the DRBD is disconnected:
  2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
     ns:79767379 nr:0 dw:137515806 dr:388623024 al:37206 bm:6688 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:58639192
According to your iostat output above, drbd2 was indeed still engaged.  And eating over 59.6% and 91.6% of a core.
Nope, definitely not connected, however, it is still part of the IO 
path, because the LV sits on drbd. So it isn't talking to it's partner, 
but it still does it's own "work" in between LVM and MD.

So, I know dd isn't the ideal performance testing tool or metric, but I'd really like to know why I can't get more than 40MB/s. There is no networking, no iscsi, just a fairly simple raid5, drbd, and lvm.
You can get much more than 40MB/s, but you must know your tools, and gain a better understanding of the Linux IO subsystem.

Apologies, it was a second late night in a row, and I wasn't doing very 
well, I should have remembered my previous lessons about this!

So, am I crazy? What totally retarded thing have I done here?
No, not crazy.  Not totally retarded.  You simply shoved a gazillion 512 byte IOs through the block layer.  Even with SSDs that's going to be slow due to the extra work the kernel threads must perform on all those tiny IOs, and all the memory bandwidth consumed by buffered IO and stripe cache operations.

The problem with your dd run here is the same problem you had before I taught you how to use FIO a year ago.  If you recall you were testing back then with a single dd process.  As I explained then, dd is a serial application.  It submits blocks one at a time with no overlap, and thus can't keep the request pipeline full.  With FIO and an appropriate job file, we kept the request pipeline full using parallel requests, and we used large IOs to keep overhead to a minimum.  The only way to increase dd throughput is to use large blocks and O_DIRECT to eliminate the RAM bandwidth of two unneeded memcpy's.

You've simply forgotten that lesson, apparently.  Which is a shame, as I spent so much time teaching you the how and why of Linux IO performance...

OK, so thinking this through... We should expect really poor performance 
if we are not using O_DIRECT, and not doing large requests in parallel. 
I think the parallel part of the workload should be fine in real world 
use, since each user and machine will be generating some random load, 
which should be delivered in parallel to the stack (LVM/DRBD/MD). 
However, in 'real world' use, we don't determine the request size, only 
the application or client OS, or perhaps iscsi will determine that.

My concern is that while I can get fantastical numbers from specific 
tests (such as highly parallel, large block size requests) I don't need 
that type of I/O, so my system isn't tuned to my needs.

After working with linbit (DRBD) I've found out some more useful 
information, which puts me right back to the beginning I think, but with 
a lot more experience and knowledge.
It seems that DRBD keeps it's own "journal", so every write is written 
to the journal, then it's bitmap is marked, then the journal is written 
to the data area, then the bitmap updated again, and then start over for 
the next write. This means it is doing lots and lots of small writes to 
the same areas of the disk ie, 4k blocks.

Anyway, I was advised to re-organise the stack from:
RAID5 -> DRBD -> LVM -> iSCSI
To:
RAID5 -> LVM -> DRBD -> iSCSI
This means each DRBD device is smaller, and so the "working set" is 
smaller, and should be more efficient. So, now I am easily able to do 
tests completely excluding drbd by targeting the LV itself. Which means 
just RAID5 + LVM layers to worry about.

When I use this fio job:
[global]
filename=/dev/vg0/testing
zero_buffers
numjobs=16
thread
group_reporting
blocksize=4k
ioengine=libaio
iodepth=16
direct=1
runtime=60
size=16g

[read]
rw=randread
stonewall

[write]
rw=randwrite
stonewall

Then I get these results:
read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
write: (g=1): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
write: (g=1): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
2.0.8
Starting 32 threads

read: (groupid=0, jobs=16): err= 0: pid=36459
  read : io=74697MB, bw=1244.1MB/s, iops=318691 , runt= 60003msec
    slat (usec): min=0 , max=999873 , avg= 5.90, stdev=529.35
    clat (usec): min=0 , max=1002.4K, avg=795.43, stdev=5201.15
     lat (usec): min=0 , max=1002.4K, avg=801.56, stdev=5233.38
    clat percentiles (usec):
     |  1.00th=[    0],  5.00th=[  213], 10.00th=[  286], 20.00th=[ 366],
     | 30.00th=[  438], 40.00th=[  516], 50.00th=[  604], 60.00th=[ 708],
     | 70.00th=[  860], 80.00th=[ 1096], 90.00th=[ 1544], 95.00th=[ 1928],
     | 99.00th=[ 2608], 99.50th=[ 2800], 99.90th=[ 3536], 99.95th=[ 4128],
     | 99.99th=[15424]
    bw (KB/s)  : min=22158, max=245376, per=6.39%, avg=81462.59, 
stdev=22339.85
    lat (usec) : 2=3.34%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (usec) : 100=0.01%, 250=3.67%, 500=31.43%, 750=24.55%, 1000=13.33%
    lat (msec) : 2=19.37%, 4=4.25%, 10=0.04%, 20=0.01%, 50=0.01%
    lat (msec) : 100=0.01%, 250=0.01%, 1000=0.01%, 2000=0.01%
  cpu          : usr=30.27%, sys=236.67%, ctx=239859018, majf=0, minf=64588
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued    : total=r=19122474/w=0/d=0, short=r=0/w=0/d=0
write: (groupid=1, jobs=16): err= 0: pid=38376
  write: io=13885MB, bw=236914KB/s, iops=59228 , runt= 60016msec
    slat (usec): min=2 , max=25554K, avg=25.74, stdev=17219.99
    clat (usec): min=122 , max=43459K, avg=4294.06, stdev=100111.47
     lat (usec): min=129 , max=43459K, avg=4319.92, stdev=101581.66
    clat percentiles (usec):
     |  1.00th=[  482],  5.00th=[  628], 10.00th=[  748], 20.00th=[ 996],
     | 30.00th=[ 1320], 40.00th=[ 1784], 50.00th=[ 2352], 60.00th=[ 3056],
     | 70.00th=[ 4192], 80.00th=[ 5920], 90.00th=[ 8384], 95.00th=[10816],
     | 99.00th=[17536], 99.50th=[20096], 99.90th=[57088], 99.95th=[67072],
     | 99.99th=[123392]
    bw (KB/s)  : min=   98, max=25256, per=6.74%, avg=15959.71, 
stdev=2969.06
    lat (usec) : 250=0.01%, 500=1.25%, 750=8.72%, 1000=10.13%
    lat (msec) : 2=23.87%, 4=24.78%, 10=24.87%, 20=5.85%, 50=0.39%
    lat (msec) : 100=0.11%, 250=0.01%, 750=0.01%, 2000=0.01%, >=2000=0.01%
  cpu          : usr=5.47%, sys=39.74%, ctx=54762279, majf=0, minf=62375
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued    : total=r=0/w=3554662/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=74697MB, aggrb=1244.1MB/s, minb=1244.1MB/s, 
maxb=1244.1MB/s, mint=60003msec, maxt=60003msec

Run status group 1 (all jobs):
  WRITE: io=13885MB, aggrb=236914KB/s, minb=236914KB/s, 
maxb=236914KB/s, mint=60016msec, maxt=60016msec

So, a maximum of 237MB/s write. Once DRBD takes that and adds it's 
overhead, I'm getting approx 10% of that performance (some of the time, 
other times I'm getting even less, but that is probably yet another issue).

Now, 237MB/s is pretty poor, and when you try and share that between a 
dozen VM's, with some of those VM's trying to work on 2+ GB files 
(outlook users), then I suspect that is why there are so many issues. 
The question is, what can I do to improve this? Should I use RAID5 with 
a smaller stripe size? Should I use RAID10 or RAID1+linear? Could the 
issue be from LVM? LVM is using 4MB Physical Extents, from reading 
though, nobody seems to worry about the PE size related to performance 
(only LVM1 had a limit on the number of PE's... which meant a larger LV 
required larger PE's).

Here is the current md array:
/dev/md1:
        Version : 1.2
  Creation Time : Wed Aug 22 00:47:03 2012
     Raid Level : raid5
     Array Size : 2813087616 (2682.77 GiB 2880.60 GB)
  Used Dev Size : 468847936 (447.13 GiB 480.10 GB)
   Raid Devices : 7
  Total Devices : 7
    Persistence : Superblock is persistent

    Update Time : Sun Apr  6 05:19:14 2014
          State : clean
 Active Devices : 7
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : san1:1  (local to host san1)
           UUID : 707957c0:b7195438:06da5bc4:485d301c
         Events : 1713347

    Number   Major   Minor   RaidDevice State
       7       8       49        0      active sync   /dev/sdd1
       6       8        1        1      active sync   /dev/sda1
       8       8       65        2      active sync   /dev/sde1
       5       8       97        3      active sync   /dev/sdg1
       9       8       81        4      active sync   /dev/sdf1
      10       8       33        5      active sync   /dev/sdc1
      11       8       17        6      active sync   /dev/sdb1

BTW, I've also split the domain controller to a win2008R2 server, and 
upgraded the file server to win2012R2.

Regards,
Adam

--
Adam Goryachev Website Managers www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html