RAID5 Performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

I know, age old question, but I have the chance to change things up a bit, and I wanted to collect some thoughts/ideas.

Currently I am using 8 x 480GB Intel SSD in a RAID5, then LVM on top, DRBD on top, and finally iSCSI on top (and then used as VM raw disks for mostly windows VM's).

My current array looks like this:

/dev/md1:
        Version : 1.2
  Creation Time : Wed Aug 22 00:47:03 2012
     Raid Level : raid5
     Array Size : 3281935552 (3129.90 GiB 3360.70 GB)
  Used Dev Size : 468847936 (447.13 GiB 480.10 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

    Update Time : Wed Jul 27 11:32:00 2016
          State : active
 Active Devices : 8
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : san1:1  (local to host san1)
           UUID : 707957c0:b7195438:06da5bc4:485d301c
         Events : 2185221

    Number   Major   Minor   RaidDevice State
       7       8       65        0      active sync   /dev/sde1
      13       8        1        1      active sync   /dev/sda1
       8       8       81        2      active sync   /dev/sdf1
       5       8      113        3      active sync   /dev/sdh1
       9       8       97        4      active sync   /dev/sdg1
      12       8       17        5      active sync   /dev/sdb1
      10       8       49        6      active sync   /dev/sdd1
      11       8       33        7      active sync   /dev/sdc1

I've configured the following non-standard options:

echo 4096 > /sys/block/md1/md/stripe_cache_size

The following apply to all SSD's installed:
echo noop > $disk/queue/scheduler
echo 128 > ${disk}/queue/nr_requests

What I can measure (at peak periods) with iostat:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 78.00 59.00 79.00 86.00 0.74 0.52 15.55 0.02 0.15 0.20 0.09 0.15 2.40 sdg 35.00 48.00 68.00 79.00 0.52 0.44 13.39 0.02 0.14 0.24 0.05 0.11 1.60 sdf 46.00 65.00 86.00 98.00 0.76 0.58 14.96 0.03 0.17 0.09 0.24 0.09 1.60 sdh 97.00 45.00 70.00 141.00 0.66 0.68 12.96 0.08 0.36 0.29 0.40 0.34 7.20 sde 101.00 75.00 87.00 94.00 0.79 0.61 15.76 0.08 0.42 0.32 0.51 0.29 5.20 sdb 85.00 54.00 94.00 102.00 0.84 0.56 14.62 0.01 0.04 0.09 0.00 0.04 0.80 sdc 85.00 74.00 98.00 106.00 0.79 0.66 14.53 0.01 0.06 0.04 0.08 0.04 0.80 sdd 230.00 199.00 266.00 353.00 2.19 2.11 14.24 0.18 0.28 0.23 0.32 0.16 9.60 drbd0 0.00 0.00 0.00 2.00 0.00 0.00 4.50 0.08 38.00 0.00 38.00 20.00 4.00 drbd12 0.00 0.00 1.00 1.00 0.00 0.00 7.50 0.03 14.00 4.00 24.00 14.00 2.80 drbd1 0.00 0.00 0.00 2.00 0.00 0.03 32.00 0.09 44.00 0.00 44.00 22.00 4.40 drbd9 0.00 0.00 2.00 0.00 0.01 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 drbd2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 drbd11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 drbd3 0.00 0.00 4.00 197.00 0.02 1.01 10.47 7.92 41.03 0.00 41.87 4.98 100.00 drbd4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 drbd17 0.00 0.00 1.00 0.00 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 drbd5 0.00 0.00 0.00 7.00 0.00 0.03 8.00 0.22 30.29 0.00 30.29 28.57 20.00 drbd19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 drbd6 0.00 0.00 2.00 0.00 0.01 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 drbd7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 drbd8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 drbd13 0.00 0.00 90.00 44.00 1.74 0.38 32.35 1.72 13.46 0.40 40.18 4.27 57.20 drbd15 0.00 0.00 2.00 33.00 0.02 0.29 17.86 1.40 40.91 0.00 43.39 28.34 99.20 drbd18 0.00 0.00 1.00 3.00 0.00 0.03 16.00 0.08 21.00 0.00 28.00 21.00 8.40 drbd14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 drbd10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

As you can see, the DRBD devices are busy, and slowing down the VM's, looking at the drives on the second server we can see why: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdf 67.00 76.00 64.00 113.00 0.52 0.62 13.17 0.26 1.47 0.06 2.27 1.45 25.60 sdg 39.00 61.00 50.00 114.00 0.35 0.56 11.38 0.45 2.76 0.08 3.93 2.71 44.40 sdd 49.00 67.00 50.00 109.00 0.39 0.57 12.40 0.75 4.73 0.00 6.90 4.70 74.80 sdh 55.00 54.00 52.00 104.00 0.42 0.51 12.12 0.81 5.21 0.23 7.69 5.13 80.00 sde 67.00 67.00 75.00 129.00 0.56 0.65 12.13 0.94 4.59 0.69 6.85 4.24 86.40 sda 64.00 76.00 58.00 109.00 0.48 0.61 13.29 0.84 5.03 0.21 7.60 4.89 81.60 sdb 35.00 72.00 57.00 104.00 0.36 0.57 11.84 0.69 4.27 0.14 6.54 4.22 68.00 sdc 118.00 144.00 228.00 269.00 1.39 1.50 11.92 1.21 2.43 1.88 2.90 1.50 74.40 md1 0.00 0.00 0.00 260.00 0.00 1.70 13.38 0.00 0.00 0.00 0.00 0.00 0.00

I've confirmed that the problem is that we have mixed two models of SSD (520 series and 530 series), and that the 530 series drives perform significantly worse (under load) in comparison. Above, the two 520 series are sdf and sdg while the other drives are 530 series. So, we will be replacing all of the drives across both systems with 545s series 1000GB SSD's (which I've confirmed will operate same or better than the 520 series, sdc on the first machine above is one of these already).

Over the years, I've learned a lot about RAID and optimisation, originally I configured things to optimise for super fast streaming reads and streaming writes, but in practice, the actual work-load is small random read/write, with the writes causing the biggest load.

Looking at this:
http://serverfault.com/questions/384273/optimizing-raid-5-for-backuppc-use-small-random-reads

 *

    Enhance the queue depth. Standard kernel queue depth is OK for old
    single drives with small caches, but not for modern drives or RAID
    arrays:

    echo 512 > /sys/block/sda/queue/nr_requests

So my question is should I increase the configured nr_requests above the current 128?

If the chunk size is 64k, and there are 8 drives in total, then the stripe size is currently 64k*7 = 448k, is this too big? My reading of the mdadm man page suggests the minimum chunk size is 4k ("In any case it must be a multiple of 4KB"). If I set the chunk size to 4k, then the stripe size becomes 28k, which means for a random 4k write, we only need to write 28k instead of 448k ? The drives report a sector size of 512k, which I guess means the smallest meaningful write that the drive can do is 512k, so should I increase the chunk size to 512k to match? Or does that make it even worse? Finally, the drive reports Host_Writes_32MiB in SMART, does that mean that the drive needs to replace a entire 32MB chunk in order to overwrite a sector? I'm guessing a chunk size of 32M is just crazy though...

Is there a better way to actually measure the different sizes and quantity of read/writes being issued, so that I can make a more accurate decision on chunk size/stripe size/etc... iostat seems to show an average numbers, but not the number of 1k read/write, 4k read/write, 16k read/write etc...

My suspicion is that the actual load is made up of rather small random read/write, because that is the scenario that produced the worst performance results when I was initially setting this up, and seems to be what we are getting in practice.

The last option is, what if I moved to RAID10? Would that provide a significant performance boost (completely removes the need to worry about chunk/stripe size because we always just write the exact data we want, no need to read/compute/write)? OR, is that read/compute overhead negligible since I'm using SSD and read performance is so quick?

For completeness, PV information:
  PV Name               /dev/md1
  VG Name               vg0
  PV Size               3.06 TiB / not usable 2.94 MiB
  Allocatable           yes
  PE Size               4.00 MiB
  Total PE              801253
  Free PE               33281
  Allocated PE          767972
  PV UUID c0PIEb-tUka-zBk3-lcGM-H89s-ayde-hcMUBZ

Any advice or assistance would be greatly appreciated.

Regards,
Adam
--
Adam Goryachev Website Managers www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux