Hi all,
I know, age old question, but I have the chance to change things up a
bit, and I wanted to collect some thoughts/ideas.
Currently I am using 8 x 480GB Intel SSD in a RAID5, then LVM on top,
DRBD on top, and finally iSCSI on top (and then used as VM raw disks for
mostly windows VM's).
My current array looks like this:
/dev/md1:
Version : 1.2
Creation Time : Wed Aug 22 00:47:03 2012
Raid Level : raid5
Array Size : 3281935552 (3129.90 GiB 3360.70 GB)
Used Dev Size : 468847936 (447.13 GiB 480.10 GB)
Raid Devices : 8
Total Devices : 8
Persistence : Superblock is persistent
Update Time : Wed Jul 27 11:32:00 2016
State : active
Active Devices : 8
Working Devices : 8
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
Name : san1:1 (local to host san1)
UUID : 707957c0:b7195438:06da5bc4:485d301c
Events : 2185221
Number Major Minor RaidDevice State
7 8 65 0 active sync /dev/sde1
13 8 1 1 active sync /dev/sda1
8 8 81 2 active sync /dev/sdf1
5 8 113 3 active sync /dev/sdh1
9 8 97 4 active sync /dev/sdg1
12 8 17 5 active sync /dev/sdb1
10 8 49 6 active sync /dev/sdd1
11 8 33 7 active sync /dev/sdc1
I've configured the following non-standard options:
echo 4096 > /sys/block/md1/md/stripe_cache_size
The following apply to all SSD's installed:
echo noop > $disk/queue/scheduler
echo 128 > ${disk}/queue/nr_requests
What I can measure (at peak periods) with iostat:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sdi 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 78.00 59.00 79.00 86.00 0.74 0.52
15.55 0.02 0.15 0.20 0.09 0.15 2.40
sdg 35.00 48.00 68.00 79.00 0.52 0.44
13.39 0.02 0.14 0.24 0.05 0.11 1.60
sdf 46.00 65.00 86.00 98.00 0.76 0.58
14.96 0.03 0.17 0.09 0.24 0.09 1.60
sdh 97.00 45.00 70.00 141.00 0.66 0.68
12.96 0.08 0.36 0.29 0.40 0.34 7.20
sde 101.00 75.00 87.00 94.00 0.79 0.61
15.76 0.08 0.42 0.32 0.51 0.29 5.20
sdb 85.00 54.00 94.00 102.00 0.84 0.56
14.62 0.01 0.04 0.09 0.00 0.04 0.80
sdc 85.00 74.00 98.00 106.00 0.79 0.66
14.53 0.01 0.06 0.04 0.08 0.04 0.80
sdd 230.00 199.00 266.00 353.00 2.19 2.11
14.24 0.18 0.28 0.23 0.32 0.16 9.60
drbd0 0.00 0.00 0.00 2.00 0.00 0.00
4.50 0.08 38.00 0.00 38.00 20.00 4.00
drbd12 0.00 0.00 1.00 1.00 0.00 0.00
7.50 0.03 14.00 4.00 24.00 14.00 2.80
drbd1 0.00 0.00 0.00 2.00 0.00 0.03
32.00 0.09 44.00 0.00 44.00 22.00 4.40
drbd9 0.00 0.00 2.00 0.00 0.01 0.00
8.00 0.00 0.00 0.00 0.00 0.00 0.00
drbd2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
drbd11 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
drbd3 0.00 0.00 4.00 197.00 0.02 1.01
10.47 7.92 41.03 0.00 41.87 4.98 100.00
drbd4 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
drbd17 0.00 0.00 1.00 0.00 0.00 0.00
8.00 0.00 0.00 0.00 0.00 0.00 0.00
drbd5 0.00 0.00 0.00 7.00 0.00 0.03
8.00 0.22 30.29 0.00 30.29 28.57 20.00
drbd19 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
drbd6 0.00 0.00 2.00 0.00 0.01 0.00
8.00 0.00 0.00 0.00 0.00 0.00 0.00
drbd7 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
drbd8 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
drbd13 0.00 0.00 90.00 44.00 1.74 0.38
32.35 1.72 13.46 0.40 40.18 4.27 57.20
drbd15 0.00 0.00 2.00 33.00 0.02 0.29
17.86 1.40 40.91 0.00 43.39 28.34 99.20
drbd18 0.00 0.00 1.00 3.00 0.00 0.03
16.00 0.08 21.00 0.00 28.00 21.00 8.40
drbd14 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
drbd10 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
As you can see, the DRBD devices are busy, and slowing down the VM's,
looking at the drives on the second server we can see why:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sdf 67.00 76.00 64.00 113.00 0.52 0.62
13.17 0.26 1.47 0.06 2.27 1.45 25.60
sdg 39.00 61.00 50.00 114.00 0.35 0.56
11.38 0.45 2.76 0.08 3.93 2.71 44.40
sdd 49.00 67.00 50.00 109.00 0.39 0.57
12.40 0.75 4.73 0.00 6.90 4.70 74.80
sdh 55.00 54.00 52.00 104.00 0.42 0.51
12.12 0.81 5.21 0.23 7.69 5.13 80.00
sde 67.00 67.00 75.00 129.00 0.56 0.65
12.13 0.94 4.59 0.69 6.85 4.24 86.40
sda 64.00 76.00 58.00 109.00 0.48 0.61
13.29 0.84 5.03 0.21 7.60 4.89 81.60
sdb 35.00 72.00 57.00 104.00 0.36 0.57
11.84 0.69 4.27 0.14 6.54 4.22 68.00
sdc 118.00 144.00 228.00 269.00 1.39 1.50
11.92 1.21 2.43 1.88 2.90 1.50 74.40
md1 0.00 0.00 0.00 260.00 0.00 1.70
13.38 0.00 0.00 0.00 0.00 0.00 0.00
I've confirmed that the problem is that we have mixed two models of SSD
(520 series and 530 series), and that the 530 series drives perform
significantly worse (under load) in comparison. Above, the two 520
series are sdf and sdg while the other drives are 530 series. So, we
will be replacing all of the drives across both systems with 545s series
1000GB SSD's (which I've confirmed will operate same or better than the
520 series, sdc on the first machine above is one of these already).
Over the years, I've learned a lot about RAID and optimisation,
originally I configured things to optimise for super fast streaming
reads and streaming writes, but in practice, the actual work-load is
small random read/write, with the writes causing the biggest load.
Looking at this:
http://serverfault.com/questions/384273/optimizing-raid-5-for-backuppc-use-small-random-reads
*
Enhance the queue depth. Standard kernel queue depth is OK for old
single drives with small caches, but not for modern drives or RAID
arrays:
echo 512 > /sys/block/sda/queue/nr_requests
So my question is should I increase the configured nr_requests above the
current 128?
If the chunk size is 64k, and there are 8 drives in total, then the
stripe size is currently 64k*7 = 448k, is this too big? My reading of
the mdadm man page suggests the minimum chunk size is 4k ("In any case
it must be a multiple of 4KB"). If I set the chunk size to 4k, then the
stripe size becomes 28k, which means for a random 4k write, we only need
to write 28k instead of 448k ?
The drives report a sector size of 512k, which I guess means the
smallest meaningful write that the drive can do is 512k, so should I
increase the chunk size to 512k to match? Or does that make it even worse?
Finally, the drive reports Host_Writes_32MiB in SMART, does that mean
that the drive needs to replace a entire 32MB chunk in order to
overwrite a sector? I'm guessing a chunk size of 32M is just crazy though...
Is there a better way to actually measure the different sizes and
quantity of read/writes being issued, so that I can make a more accurate
decision on chunk size/stripe size/etc... iostat seems to show an
average numbers, but not the number of 1k read/write, 4k read/write, 16k
read/write etc...
My suspicion is that the actual load is made up of rather small random
read/write, because that is the scenario that produced the worst
performance results when I was initially setting this up, and seems to
be what we are getting in practice.
The last option is, what if I moved to RAID10? Would that provide a
significant performance boost (completely removes the need to worry
about chunk/stripe size because we always just write the exact data we
want, no need to read/compute/write)?
OR, is that read/compute overhead negligible since I'm using SSD and
read performance is so quick?
For completeness, PV information:
PV Name /dev/md1
VG Name vg0
PV Size 3.06 TiB / not usable 2.94 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 801253
Free PE 33281
Allocated PE 767972
PV UUID c0PIEb-tUka-zBk3-lcGM-H89s-ayde-hcMUBZ
Any advice or assistance would be greatly appreciated.
Regards,
Adam
--
Adam Goryachev Website Managers www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html