RAID5 Performance

Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> · Wed, 27 Jul 2016 12:24:31 +1000

Hi all,

I know, age old question, but I have the chance to change things up a 
bit, and I wanted to collect some thoughts/ideas.

Currently I am using 8 x 480GB Intel SSD in a RAID5, then LVM on top, 
DRBD on top, and finally iSCSI on top (and then used as VM raw disks for 
mostly windows VM's).

My current array looks like this:

/dev/md1:
        Version : 1.2
  Creation Time : Wed Aug 22 00:47:03 2012
     Raid Level : raid5
     Array Size : 3281935552 (3129.90 GiB 3360.70 GB)
  Used Dev Size : 468847936 (447.13 GiB 480.10 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

    Update Time : Wed Jul 27 11:32:00 2016
          State : active
 Active Devices : 8
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : san1:1  (local to host san1)
           UUID : 707957c0:b7195438:06da5bc4:485d301c
         Events : 2185221

    Number   Major   Minor   RaidDevice State
       7       8       65        0      active sync   /dev/sde1
      13       8        1        1      active sync   /dev/sda1
       8       8       81        2      active sync   /dev/sdf1
       5       8      113        3      active sync   /dev/sdh1
       9       8       97        4      active sync   /dev/sdg1
      12       8       17        5      active sync   /dev/sdb1
      10       8       49        6      active sync   /dev/sdd1
      11       8       33        7      active sync   /dev/sdc1

I've configured the following non-standard options:

echo 4096 > /sys/block/md1/md/stripe_cache_size

The following apply to all SSD's installed:
echo noop > $disk/queue/scheduler
echo 128 > ${disk}/queue/nr_requests

What I can measure (at peak periods) with iostat:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdi               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda              78.00    59.00   79.00   86.00     0.74 0.52    
15.55     0.02    0.15    0.20    0.09   0.15   2.40
sdg              35.00    48.00   68.00   79.00     0.52 0.44    
13.39     0.02    0.14    0.24    0.05   0.11   1.60
sdf              46.00    65.00   86.00   98.00     0.76 0.58    
14.96     0.03    0.17    0.09    0.24   0.09   1.60
sdh              97.00    45.00   70.00  141.00     0.66 0.68    
12.96     0.08    0.36    0.29    0.40   0.34   7.20
sde             101.00    75.00   87.00   94.00     0.79 0.61    
15.76     0.08    0.42    0.32    0.51   0.29   5.20
sdb              85.00    54.00   94.00  102.00     0.84 0.56    
14.62     0.01    0.04    0.09    0.00   0.04   0.80
sdc              85.00    74.00   98.00  106.00     0.79 0.66    
14.53     0.01    0.06    0.04    0.08   0.04   0.80
sdd             230.00   199.00  266.00  353.00     2.19 2.11    
14.24     0.18    0.28    0.23    0.32   0.16   9.60
drbd0             0.00     0.00    0.00    2.00     0.00 0.00     
4.50     0.08   38.00    0.00   38.00  20.00   4.00
drbd12            0.00     0.00    1.00    1.00     0.00 0.00     
7.50     0.03   14.00    4.00   24.00  14.00   2.80
drbd1             0.00     0.00    0.00    2.00     0.00 0.03    
32.00     0.09   44.00    0.00   44.00  22.00   4.40
drbd9             0.00     0.00    2.00    0.00     0.01 0.00     
8.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd2             0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd11            0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd3             0.00     0.00    4.00  197.00     0.02 1.01    
10.47     7.92   41.03    0.00   41.87   4.98 100.00
drbd4             0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd17            0.00     0.00    1.00    0.00     0.00 0.00     
8.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd5             0.00     0.00    0.00    7.00     0.00 0.03     
8.00     0.22   30.29    0.00   30.29  28.57  20.00
drbd19            0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd6             0.00     0.00    2.00    0.00     0.01 0.00     
8.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd7             0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd8             0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd13            0.00     0.00   90.00   44.00     1.74 0.38    
32.35     1.72   13.46    0.40   40.18   4.27  57.20
drbd15            0.00     0.00    2.00   33.00     0.02 0.29    
17.86     1.40   40.91    0.00   43.39  28.34  99.20
drbd18            0.00     0.00    1.00    3.00     0.00 0.03    
16.00     0.08   21.00    0.00   28.00  21.00   8.40
drbd14            0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd10            0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00

As you can see, the DRBD devices are busy, and slowing down the VM's, 
looking at the drives on the second server we can see why:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdf              67.00    76.00   64.00  113.00     0.52 0.62    
13.17     0.26    1.47    0.06    2.27   1.45  25.60
sdg              39.00    61.00   50.00  114.00     0.35 0.56    
11.38     0.45    2.76    0.08    3.93   2.71  44.40
sdd              49.00    67.00   50.00  109.00     0.39 0.57    
12.40     0.75    4.73    0.00    6.90   4.70  74.80
sdh              55.00    54.00   52.00  104.00     0.42 0.51    
12.12     0.81    5.21    0.23    7.69   5.13  80.00
sde              67.00    67.00   75.00  129.00     0.56 0.65    
12.13     0.94    4.59    0.69    6.85   4.24  86.40
sda              64.00    76.00   58.00  109.00     0.48 0.61    
13.29     0.84    5.03    0.21    7.60   4.89  81.60
sdb              35.00    72.00   57.00  104.00     0.36 0.57    
11.84     0.69    4.27    0.14    6.54   4.22  68.00
sdc             118.00   144.00  228.00  269.00     1.39 1.50    
11.92     1.21    2.43    1.88    2.90   1.50  74.40
md1               0.00     0.00    0.00  260.00     0.00 1.70    
13.38     0.00    0.00    0.00    0.00   0.00   0.00

I've confirmed that the problem is that we have mixed two models of SSD 
(520 series and 530 series), and that the 530 series drives perform 
significantly worse (under load) in comparison. Above, the two 520 
series are sdf and sdg while the other drives are 530 series. So, we 
will be replacing all of the drives across both systems with 545s series 
1000GB SSD's (which I've confirmed will operate same or better than the 
520 series, sdc on the first machine above is one of these already).

Over the years, I've learned a lot about RAID and optimisation, 
originally I configured things to optimise for super fast streaming 
reads and streaming writes, but in practice, the actual work-load is 
small random read/write, with the writes causing the biggest load.

Looking at this:
http://serverfault.com/questions/384273/optimizing-raid-5-for-backuppc-use-small-random-reads

 *

    Enhance the queue depth. Standard kernel queue depth is OK for old
    single drives with small caches, but not for modern drives or RAID
    arrays:

    echo 512 > /sys/block/sda/queue/nr_requests

So my question is should I increase the configured nr_requests above the 
current 128?

If the chunk size is 64k, and there are 8 drives in total, then the 
stripe size is currently 64k*7 = 448k, is this too big? My reading of 
the mdadm man page suggests the minimum chunk size is 4k ("In any case 
it must be a multiple of 4KB"). If I set the chunk size to 4k, then the 
stripe size becomes 28k, which means for a random 4k write, we only need 
to write 28k instead of 448k ?
The drives report a sector size of 512k, which I guess means the 
smallest meaningful write that the drive can do is 512k, so should I 
increase the chunk size to 512k to match? Or does that make it even worse?
Finally, the drive reports Host_Writes_32MiB in SMART, does that mean 
that the drive needs to replace a entire 32MB chunk in order to 
overwrite a sector? I'm guessing a chunk size of 32M is just crazy though...

Is there a better way to actually measure the different sizes and 
quantity of read/writes being issued, so that I can make a more accurate 
decision on chunk size/stripe size/etc... iostat seems to show an 
average numbers, but not the number of 1k read/write, 4k read/write, 16k 
read/write etc...

My suspicion is that the actual load is made up of rather small random 
read/write, because that is the scenario that produced the worst 
performance results when I was initially setting this up, and seems to 
be what we are getting in practice.

The last option is, what if I moved to RAID10? Would that provide a 
significant performance boost (completely removes the need to worry 
about chunk/stripe size because we always just write the exact data we 
want, no need to read/compute/write)?
OR, is that read/compute overhead negligible since I'm using SSD and 
read performance is so quick?

For completeness, PV information:
  PV Name               /dev/md1
  VG Name               vg0
  PV Size               3.06 TiB / not usable 2.94 MiB
  Allocatable           yes
  PE Size               4.00 MiB
  Total PE              801253
  Free PE               33281
  Allocated PE          767972
  PV UUID c0PIEb-tUka-zBk3-lcGM-H89s-ayde-hcMUBZ

Any advice or assistance would be greatly appreciated.

Regards,
Adam
--
Adam Goryachev Website Managers www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html