On Tue, Jul 26, 2016 at 7:24 PM, Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> wrote: > Hi all, > > I know, age old question, but I have the chance to change things up a bit, > and I wanted to collect some thoughts/ideas. > > Currently I am using 8 x 480GB Intel SSD in a RAID5, then LVM on top, DRBD > on top, and finally iSCSI on top (and then used as VM raw disks for mostly > windows VM's). > > My current array looks like this: > > /dev/md1: > Version : 1.2 > Creation Time : Wed Aug 22 00:47:03 2012 > Raid Level : raid5 > Array Size : 3281935552 (3129.90 GiB 3360.70 GB) > Used Dev Size : 468847936 (447.13 GiB 480.10 GB) > Raid Devices : 8 > Total Devices : 8 > Persistence : Superblock is persistent > > Update Time : Wed Jul 27 11:32:00 2016 > State : active > Active Devices : 8 > Working Devices : 8 > Failed Devices : 0 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 64K > > Name : san1:1 (local to host san1) > UUID : 707957c0:b7195438:06da5bc4:485d301c > Events : 2185221 > > Number Major Minor RaidDevice State > 7 8 65 0 active sync /dev/sde1 > 13 8 1 1 active sync /dev/sda1 > 8 8 81 2 active sync /dev/sdf1 > 5 8 113 3 active sync /dev/sdh1 > 9 8 97 4 active sync /dev/sdg1 > 12 8 17 5 active sync /dev/sdb1 > 10 8 49 6 active sync /dev/sdd1 > 11 8 33 7 active sync /dev/sdc1 > > I've configured the following non-standard options: > > echo 4096 > /sys/block/md1/md/stripe_cache_size > > The following apply to all SSD's installed: > echo noop > $disk/queue/scheduler > echo 128 > ${disk}/queue/nr_requests > > What I can measure (at peak periods) with iostat: > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > sda 78.00 59.00 79.00 86.00 0.74 0.52 15.55 > 0.02 0.15 0.20 0.09 0.15 2.40 > sdg 35.00 48.00 68.00 79.00 0.52 0.44 13.39 > 0.02 0.14 0.24 0.05 0.11 1.60 > sdf 46.00 65.00 86.00 98.00 0.76 0.58 14.96 > 0.03 0.17 0.09 0.24 0.09 1.60 > sdh 97.00 45.00 70.00 141.00 0.66 0.68 12.96 > 0.08 0.36 0.29 0.40 0.34 7.20 > sde 101.00 75.00 87.00 94.00 0.79 0.61 15.76 > 0.08 0.42 0.32 0.51 0.29 5.20 > sdb 85.00 54.00 94.00 102.00 0.84 0.56 14.62 > 0.01 0.04 0.09 0.00 0.04 0.80 > sdc 85.00 74.00 98.00 106.00 0.79 0.66 14.53 > 0.01 0.06 0.04 0.08 0.04 0.80 > sdd 230.00 199.00 266.00 353.00 2.19 2.11 14.24 > 0.18 0.28 0.23 0.32 0.16 9.60 > drbd0 0.00 0.00 0.00 2.00 0.00 0.00 4.50 > 0.08 38.00 0.00 38.00 20.00 4.00 > drbd12 0.00 0.00 1.00 1.00 0.00 0.00 7.50 > 0.03 14.00 4.00 24.00 14.00 2.80 > drbd1 0.00 0.00 0.00 2.00 0.00 0.03 32.00 > 0.09 44.00 0.00 44.00 22.00 4.40 > drbd9 0.00 0.00 2.00 0.00 0.01 0.00 8.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > drbd2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > drbd11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > drbd3 0.00 0.00 4.00 197.00 0.02 1.01 10.47 > 7.92 41.03 0.00 41.87 4.98 100.00 > drbd4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > drbd17 0.00 0.00 1.00 0.00 0.00 0.00 8.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > drbd5 0.00 0.00 0.00 7.00 0.00 0.03 8.00 > 0.22 30.29 0.00 30.29 28.57 20.00 > drbd19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > drbd6 0.00 0.00 2.00 0.00 0.01 0.00 8.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > drbd7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > drbd8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > drbd13 0.00 0.00 90.00 44.00 1.74 0.38 32.35 > 1.72 13.46 0.40 40.18 4.27 57.20 > drbd15 0.00 0.00 2.00 33.00 0.02 0.29 17.86 > 1.40 40.91 0.00 43.39 28.34 99.20 > drbd18 0.00 0.00 1.00 3.00 0.00 0.03 16.00 > 0.08 21.00 0.00 28.00 21.00 8.40 > drbd14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > drbd10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 > > As you can see, the DRBD devices are busy, and slowing down the VM's, > looking at the drives on the second server we can see why: > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdf 67.00 76.00 64.00 113.00 0.52 0.62 13.17 > 0.26 1.47 0.06 2.27 1.45 25.60 > sdg 39.00 61.00 50.00 114.00 0.35 0.56 11.38 > 0.45 2.76 0.08 3.93 2.71 44.40 > sdd 49.00 67.00 50.00 109.00 0.39 0.57 12.40 > 0.75 4.73 0.00 6.90 4.70 74.80 > sdh 55.00 54.00 52.00 104.00 0.42 0.51 12.12 > 0.81 5.21 0.23 7.69 5.13 80.00 > sde 67.00 67.00 75.00 129.00 0.56 0.65 12.13 > 0.94 4.59 0.69 6.85 4.24 86.40 > sda 64.00 76.00 58.00 109.00 0.48 0.61 13.29 > 0.84 5.03 0.21 7.60 4.89 81.60 > sdb 35.00 72.00 57.00 104.00 0.36 0.57 11.84 > 0.69 4.27 0.14 6.54 4.22 68.00 > sdc 118.00 144.00 228.00 269.00 1.39 1.50 11.92 > 1.21 2.43 1.88 2.90 1.50 74.40 > md1 0.00 0.00 0.00 260.00 0.00 1.70 13.38 > 0.00 0.00 0.00 0.00 0.00 0.00 > > I've confirmed that the problem is that we have mixed two models of SSD (520 > series and 530 series), and that the 530 series drives perform significantly > worse (under load) in comparison. Above, the two 520 series are sdf and sdg > while the other drives are 530 series. So, we will be replacing all of the > drives across both systems with 545s series 1000GB SSD's (which I've > confirmed will operate same or better than the 520 series, sdc on the first > machine above is one of these already). > > Over the years, I've learned a lot about RAID and optimisation, originally I > configured things to optimise for super fast streaming reads and streaming > writes, but in practice, the actual work-load is small random read/write, > with the writes causing the biggest load. > > Looking at this: > http://serverfault.com/questions/384273/optimizing-raid-5-for-backuppc-use-small-random-reads >> >> >> * >> >> Enhance the queue depth. Standard kernel queue depth is OK for old >> single drives with small caches, but not for modern drives or RAID >> arrays: >> >> echo 512 > /sys/block/sda/queue/nr_requests >> > So my question is should I increase the configured nr_requests above the > current 128? With your workload, it probably won't matter too much. Really high queue depths are great on paper, but hard to actually see. > > If the chunk size is 64k, and there are 8 drives in total, then the stripe > size is currently 64k*7 = 448k, is this too big? My reading of the mdadm man > page suggests the minimum chunk size is 4k ("In any case it must be a > multiple of 4KB"). If I set the chunk size to 4k, then the stripe size > becomes 28k, which means for a random 4k write, we only need to write 28k > instead of 448k ? This is not how a random write works. If you are running raid-5 before the 4.4 kernel, you get the "old" read/modify/write algorithm. If you write 4K, the system will read 4K from (n-2) drives, add in your 4K to compute parity, and write 2 drives. This is n-2 reads + 2 writes. With the "new" logic in 4.4, you read the old contents of the 4K plus parity, and re-write the 4k plus parity, so there are 2 reads and 2 writes. With big arrays, the "new" logic can help quite a bit, but the chatter rate is still high. Note that the new logic is only raid-5. raid-6 cannot use the new logic and has to read the stripe from every drive. The stripe size impacts when the system does can avoid doing a read/modify/write. If you write a full stripe [ 64K * (n-1) ], and the write is exactly on a stripe boundary, and you get lucky and the background thread does not wake up at just the wrong time, you will do the write with zero reads. I personally run with very small chunks, but I have code that always writes perfect stripe writes and stock file systems don't act that way. DRBD can saturate GigE without any problem with random 4K writes. I have a pair of systems here that pushes 110 MB/sec at 4K or 28,000 IOPS. The target arrays needs to keep up, but that is another story. My testing with DRBD is that it starts to peter out at 10Gig, so if you want more bandwidth you need some other approach. Some vendors use SRP over Infiniband with software raid-1 as a mirror. iSCSI with iSER should give you similar results with RDMA capable ethernet. Linbit (the people who write DRBD) have a non GPL extension to DRBD that uses RDMA so you can get more bandwidth that way as well. > The drives report a sector size of 512k, which I guess means the smallest > meaningful write that the drive can do is 512k, so should I increase the > chunk size to 512k to match? Or does that make it even worse? > Finally, the drive reports Host_Writes_32MiB in SMART, does that mean that > the drive needs to replace a entire 32MB chunk in order to overwrite a > sector? I'm guessing a chunk size of 32M is just crazy though... This is probably not true. If the drive really had to update 512K at a time, then 4K writes would be 128x wear amplification. SSDs can be bad, but usually not that bad. > > Is there a better way to actually measure the different sizes and quantity > of read/writes being issued, so that I can make a more accurate decision on > chunk size/stripe size/etc... iostat seems to show an average numbers, but > not the number of 1k read/write, 4k read/write, 16k read/write etc... The problem is that the FTL of the SSDs are a black box and as the array gets bigger, the slowest drive dictates the array performance. This is why the "big vendors" all map SSDs in the host and avoid or minimize writing randomly. I know of one vendor install that has 4000 VDI seats (using ESXI as compute hosts) from a single HA pair of 24 SSD shelves. The connection to ESXI is FC and the hosts are HA with an IB/SRP raid-1 link between them. Unfortunately, you need 500K+ random write IOPS to pull this off, which I think is impossible with stock parity raid, and very hard with raid-10. > > My suspicion is that the actual load is made up of rather small random > read/write, because that is the scenario that produced the worst performance > results when I was initially setting this up, and seems to be what we are > getting in practice. > > The last option is, what if I moved to RAID10? Would that provide a > significant performance boost (completely removes the need to worry about > chunk/stripe size because we always just write the exact data we want, no > need to read/compute/write)? RAID-10 will be faster, but you pay for this with capacity. It is also a double-edged sword as SSDs themselves run faster if you leave more free space on them, so RAID-10 absolutely might not be a lot faster than RAID-5 with some space left over. Also remember that free space on the SSDs only counts if it is actually unallocated. So you need to trim the SSDs or start with a secure erased drive and then never use the full capacity. It is best to leave an empty partition that is untouched. > OR, is that read/compute overhead negligible since I'm using SSD and read > performance is so quick? The reads, especially with the pre 4.4 code or with raid-6 definitely take their toll. Most SSDs are also not quite symmetrical in terms of performance. If your SSD does 50K read IOPS and 50K write IOPS, it will probably not do 25K reads and 25K writes concurrently, but instead stop somewhere around 18K. But your mileage may vary. If you have 8 drives that do 20 read/write symmetric, with new raid-5, each 4K write is 2 reads and 2 writes. 8 drives will give you 8*20K = 160K reads and writes or 320K total OPS. Each 4K write takes 4 OPS, so your data rate ends up maxing out at 80K IOPS. With the old raid-5 logic, you end up with 6 reads plus two writes per "OP", so you tend to max out around 320K/(6+2) = 40K IOPS. With more than 8 drives, these computations tend to fall apart, so 24 SSD arrays are not 3x faster than 8 SSD arrays, at least with stock code. You also need to consider what raid does to the SSD FTL. As you chatter a drive, its wear goes up and its performance goes down. Different SSD models can vary wildly, but again the rule of thumb is keep as much free space as possible on the drives. raid-5 or mirroring is also 2:1 write amplification (ie, you are writing two drives) and raid-6 is 3:1, on top of whatever the FTL write amplification is at the time. > > For completeness, PV information: > PV Name /dev/md1 > VG Name vg0 > PV Size 3.06 TiB / not usable 2.94 MiB > Allocatable yes > PE Size 4.00 MiB > Total PE 801253 > Free PE 33281 > Allocated PE 767972 > PV UUID c0PIEb-tUka-zBk3-lcGM-H89s-ayde-hcMUBZ > > Any advice or assistance would be greatly appreciated. > > Regards, > Adam > -- > Adam Goryachev Website Managers www.websitemanagers.com.au > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Dumitru WileFire Storage. http://www.wildfire-storage.com -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html