Re: RAID5 Performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 29/07/16 03:20, Peter Grandi wrote:
[ ... ]

* Replace the flash SSDs with those that are known to deliver
   high (at least > 10,000 single threaded) small synchronous
   write IOPS.
Is there a "known" SSD that you would suggest? My problem is
that Intel spec sheets seem to suggest that there is little
performance difference across the range of SSD's, so it's
really not clear which SSD model I should buy.
The links I wrote earlier have lists:
Thanks for reminding me of that. I see that the list reflects my experience (if we assume the 530 model is equivalent to the 535 model on the list, and my 520 480GB is equivalent to the 520 on the list).

However, I can't get the budget for those really awesome drives up the top of the list, that would require around $20k... or more.

For now, I've got 16 x 545s TB drives, and have replaced the first half (ie, all drives in one server). Now I can see that the drives themselves don't seem to be the bottleneck (the drives don't run at 100% util, while the DRBD device does run at 100%).

I've written a small script to keep track of the number of seconds each drive util value fits into each bracket (increments of 10%). Let me know if you would like a copy (it's just a perl script which reads iostat output, I'm sure it could be written much nicer). So far, this is what I get on the secondary (with the new 8 x 845s 1TB drives): Drive 10 20 30 40 50 60 70 80 90 100 md1 19265 0 0 0 0 0 0 0 0 0
sda         17029    1579    404    137    49    45    13 4      4    1
sdb         16983    1453    477    179    77    63    22      6 3     2
sdc         16867    1579    492    182    76    40    17      8 1     3
sdd         17043    1499    445    154    59    40    14      6 3     2
sde         17064    1506    415    152    68    32    15      4 6     3
sdf          17138    1467    396    152    46    37    11    10 4     4
sdg         17118    1493    401    139    56    31    14      7 2     4
sdh         16997    1577    407    138    62    45    11    10 7     6
sdi 19236 12 4 4 2 0 2 3 0 1

Hopefully that will line up right !
So, out of the last 19265 seconds, each of the underlying drives was at 100% for only a couple of seconds (sdi is the OS drive). ie, the last column shows the number of seconds the drive was at 90 to 100% util as reported by iostat. The 10 column shows number of seconds between 0 and 10%, etc...

Looking at the primary, with all 520 series drives (except sda which is a 545s series) and the DRBD drives I see this:

Drive 10 20 30 40 50 60 70 80 90 100 drbd0 19971 108 54 36 13 2 0 2 1 1 drbd1 19842 165 77 48 34 4 6 5 3 3 drbd10 19766 279 62 35 23 7 6 4 2 1 drbd11 20081 37 32 21 12 1 3 1 0 0 drbd12 20041 79 38 19 9 1 0 0 1 0
drbd13      16195    2335   758   338   220  131    77    39 32    58
drbd14 19765 230 90 49 30 9 4 6 2 1
drbd15        3473    6323 4136 2250 1390  913  614  443  418  220
drbd17 20175 9 1 0 3 0 0 0 0 0
drbd18      19878    170       65     29     23    10      4 0      6      1
drbd19      19255    368     138     86     87  100    39    35 44    35
drbd2 20188 0 0 0 0 0 0 0 0 0
drbd3        17457    1276   610   316   175  140    66    43 33    56
drbd4 20154 17 6 6 5 0 0 0 0 0
drbd5        19859    141      59      38     26    10      4 5      3    42
drbd6 20112 39 20 9 3 1 1 1 1 0 drbd7 20188 0 0 0 0 0 0 0 0 0 drbd8 19894 136 78 44 22 5 3 2 0 2 drbd9 19476 289 211 123 41 21 9 6 3 7 md1 20188 0 0 0 0 0 0 0 0 0
sda           16948    1696   439    286   213  206  316    81 3      0
sdb           16059    2177   844    402   290  352    50    13 1      0
sdc           16141    2132   852    388   312  328    30      5    0      0
sdd           15914    2182   956    395   300  362    72      6    1      0
sde           16099    2137   801    393   256  366  124    10 1      1
sdf 16000 2169 898 408 322 340 39 9 3 0
sdg           15929    2265   822    418   259  290  195      8 2      0
sdh           16107    2129   822    419   324  337    41      9    0      0
sdi 20155 3 3 7 14 6 0 0 0 0

So on the primary, I see even less of a bottleneck on the underlying drives, which doesn't make a lot of sense to me. The secondary has less read load (since all reads are handled by the primary), and should only need to deal with raid rmw. Also, I'm not sure, but I think the secondary does less meta data updates for DRBD. So I can only presume the new drives are much better than the 530 series, but still not as good as the 520 series. I'll need to run some tests before I put the drives live next time.

However, the point of note is that DRBD devices are showing high util levels much more frequently than the underlying devices, so I can only assume that the current limitation is caused by DRBD rather than the drives. Though probably solving the DRBD issue will then go back to the drives being the limit, with not a lot of difference. See below for my (your) ideas on improving both of those things.....

   https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
   http://www.spinics.net/lists/ceph-users/msg25928.html
   https://www.redhat.com/en/resources/ceph-pcie-ssd-performance-part-1
As one of those pages says the Samsung SM863 looks attractive,
but for historical reasons so far I have only seen Intel DCs in
similar use. There discussions of other models in various posts
related to Ceph journal SSD usage.

Obviously it's not something I can afford to buy one of each
and test them either.
In addition to the lists above I have justed tested my three
home flash SSDs:

* Micron M4 256GB:
     #  dd bs=4k count=100000 oflag=direct,dsync if=/dev/zero of=/var/tmp/TEST
     100000+0 records in
     100000+0 records out
     409600000 bytes (410 MB) copied, 1200.3 s, 341 kB/s
* Samsung 850 Pro 256GB:
     #  dd bs=4k count=100000 oflag=direct,dsync if=/dev/zero of=/var/tmp/TEST
     100000+0 records in
     100000+0 records out
     409600000 bytes (410 MB) copied, 1732.93 s, 236 kB/s
* Hynix SK SH910 256GB:
     #  dd bs=4k count=100000 oflag=direct,dsync if=/dev/zero of=/var/tmp/TEST
     100000+0 records in
     100000+0 records out
     409600000 bytes (410 MB) copied, 644.742 s, 635 kB/s

So I would not recommend any of them for "small sync writes"
workloads :-), but they are quite good otherwise. I do notice
they are slow on small sync writes when downloading mail, as
each message is duly 'fsync'ed.

BTW as bonus material, I have done on the SH910 an abbreviated
test with block sizes between 4KiB and 1024KiB:

   #  for N in 4k 16k 64k 128k 256k 512k 1024k; do echo -n "$N: "; dd bs=$N count=1000 oflag=dsync if=/dev/zero of=/var/tmp/TEST |& grep copied; done
   4k: 4096000 bytes (4.1 MB) copied, 6.23481 s, 657 kB/s
   16k: 16384000 bytes (16 MB) copied, 6.29379 s, 2.6 MB/s
   64k: 65536000 bytes (66 MB) copied, 6.09223 s, 10.8 MB/s
   128k: 131072000 bytes (131 MB) copied, 6.5487 s, 20.0 MB/s
   256k: 262144000 bytes (262 MB) copied, 6.93361 s, 37.8 MB/s
   512k: 524288000 bytes (524 MB) copied, 7.73957 s, 67.7 MB/s
   1024k: 1048576000 bytes (1.0 GB) copied, 12.8671 s, 81.5 MB/s

Note how the time to write 1000 blocks is essentially the same
betweeen 4KiB and 128KiB, which is quite amusing. Probably the
flash-page size is around 256KiB.

For additional bonus value the same on a "fastish" consumer 2TB
disk, a Seagate ST2000DM001:

   #  for N in 4k 16k 64k 128k 256k 512k 1024k; do echo -n "$N: "; dd bs=$N count=1000 oflag=dsync if=/dev/zero of=/fs/sdb6/tmp/TEST |& grep copied; done
   4k: 4096000 bytes (4.1 MB) copied, 44.9177 s, 91.2 kB/s
   16k: 16384000 bytes (16 MB) copied, 38.131 s, 430 kB/s
   64k: 65536000 bytes (66 MB) copied, 35.8263 s, 1.8 MB/s
   128k: 131072000 bytes (131 MB) copied, 35.8188 s, 3.7 MB/s
   256k: 262144000 bytes (262 MB) copied, 36.6838 s, 7.1 MB/s
   512k: 524288000 bytes (524 MB) copied, 37.0612 s, 14.1 MB/s
   1024k: 1048576000 bytes (1.0 GB) copied, 42.0844 s, 24.9 MB/s


Yep, definitely won't be going backwards to spinning disks :)

* Relax the requirement for synchronous writes on *both* the
   primary and secondary DRBD servers, if feeling lucky.
I have the following entries for DRBD which were suggested by
linbit (which previously lifted performance from abysmal to
more than sufficient around 2+ years ago). [ ... ]
That's an inappropriate use of "performance" here:

          disk-barrier no;
          disk-flushes no;
          md-flushes no;
That "feeling lucky" list seems to me to have made performance
lower (in the sense that the performance of writing to
'/dev/null' is zero, even if the speed is really good :->).

With those settings the data sync policy is "disk-drain", which
also involves some waiting, but somewhat dangerous, except "In
case your backing storage device has battery-backed write cache"
(and "device" here means system and host adapter and disk); it
is not clear to me for metadata what "md-flushes no" gives.

BTW if you have battery-backed everything on the secondary side
you could use protocol "B".
From my understanding, the times these settings can cause a problem:
1) When both servers hard power off - possibly all the latest data is not written to disk that the VM's expect. If this is the case, all the VM's were also hard powered off, and so the VM has no idea about what it expects to be written/not. The end user may need to redo some work/etc, but that is acceptable. Worst case scenario, a DB file is corrupted and needs to be restored from the previous night backup, and users must redo all work, which is also "acceptable" (from a risk point of view). 2) One server hard power off, perhaps power supply failure/etc - When it powers on again, it should re-sync with the DRBD primary, and potentially we do a DRBD verify to confirm everything is good. As long as there is no failure on the primary, then everything is good. Worst case, catastrophic failure of the primary before the verify is complete, or before the secondary comes on-line again, and basically we treat it as above.

We can't deal with every possible scenario, as the cost is prohibitive, we can only deal with the more common scenarios, and those that are cheaper to deal with. eg, all equipment is protected by UPS, using redundancy RAID instead of linear/striping, and using DRBD for replication. Most likely failures are disk, power supply, or network cables (ie, unplugged by accident/etc), and this setup protects well for all three of those.
However given those it looks likely that the bottleneck is also
on the primary DRBD side.

Do you have any other suggestions or ideas that might assist?
* Smaller RAID5 stripes, as in 4+1 or 2+1, are cheaper in space
   than RAID10 and enormously raise the chances that a full
   stripe-write can happen (it still has the write-hole problem
   of parity RAID).
I was planning to upgrade to the 4.4.x kernel, which would kind of solve this, since it will only read from 2 drives anyway, but it turns out that is more difficult than I expected. (iscsitarget kernel module doesn't compile cleanly with the new kernel, and it doesn't seem to be well supported into such recent kernel versions. I'll probably wait until debian testing becomes stable, or at least a lot closer, before going down that path).

I could potentially move to 2 x RAID5 with 3+1 and then linear or stripe those, which means I only lose one extra disk of capacity.... Will need to think about that further...

* Make sure the DRBD journal is also on a separate device that
   allows fast small sync writes.

I think this would be the next option to investigate. Currently the DRBD journal is on the same devices. Reading from: http://www.drbd.org/en/doc/users-guide-84/ch-internals#s-internal-meta-data

*Advantage. *For some write operations, using external meta data produces a somewhat improved latency behavior.

Do you have any more knowledge on the expected performance advantage? ie, would half the writes move from the data drive to the meta data drive? I'm thinking it might be plausible to purchase 2 x Intel P3700 400GB and put one in each DRBD server for the meta data updates. Although if this isn't going to make much difference (eg, only 20%) then it is less likely to be worthwhile... Can anyone suggest what kind of performance improvement might I see by doing this? The alternative (for double the cost + a bit more) would be to migrate from RAID5 to RAID10, is that likely to produce a better/worse result?

2 x P3700 400GB is probably around $2500, while 12 x 545s 1000GB is around $4800, but would need to add another SATA controller card, which probably means changing motherboard/CPU/etc as well, so that becomes a lot more....

Also, I have appended a sample DRBD configuration I have used:

----------------------------------------------------------------

     # http://article.gmane.org/gmane.linux.network.drbd/18348
     # http://www.drbd.org/users-guide-8.3/s-throughput-tuning.html
     # https://alteeve.ca/w/AN!Cluster_Tutorial_2_-_Performance_Tuning
     # http://fghaas.wordpress.com/2007/06/22/performance-tuning-drbd-setups/
     sndbuf-size		    0;
     rcvbuf-size		    0;
     max-buffers		    16384;
     unplug-watermark	    16384;
     max-epoch-size	    16384;
I have similar values, but will need to investigate the above options further. rcvbuf-size doesn't seem to be well documented, at least in the DRBD 8.4 manual, but will research these some more. Then will also need to check how to modify the values without causing a system meltdown....

Thanks again for your advice/information, it is very helpful.

Regards,
Adam



--
Adam Goryachev Website Managers www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux