Re: RAID5 Performance

Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> · Tue, 2 Aug 2016 17:09:31 +1000

On 29/07/16 03:20, Peter Grandi wrote:
[ ... ]

* Replace the flash SSDs with those that are known to deliver
   high (at least > 10,000 single threaded) small synchronous
   write IOPS.
Is there a "known" SSD that you would suggest? My problem is
that Intel spec sheets seem to suggest that there is little
performance difference across the range of SSD's, so it's
really not clear which SSD model I should buy.
The links I wrote earlier have lists:
Thanks for reminding me of that. I see that the list reflects my 
experience (if we assume the 530 model is equivalent to the 535 model on 
the list, and my 520 480GB is equivalent to the 520 on the list).

However, I can't get the budget for those really awesome drives up the 
top of the list, that would require around $20k... or more.

For now, I've got 16 x 545s TB drives, and have replaced the first half 
(ie, all drives in one server). Now I can see that the drives themselves 
don't seem to be the bottleneck (the drives don't run at 100% util, 
while the DRBD device does run at 100%).

I've written a small script to keep track of the number of seconds each 
drive util value fits into each bracket (increments of 10%). Let me know 
if you would like a copy (it's just a perl script which reads iostat 
output, I'm sure it could be written much nicer).
So far, this is what I get on the secondary (with the new 8 x 845s 1TB 
drives):
Drive          10        20       30       40    50    60    70 80    
90   100
md1        19265         0        0        0      0      0 0      0      
0    0
sda         17029    1579    404    137    49    45    13 4      4    1
sdb         16983    1453    477    179    77    63    22      6 3     2
sdc         16867    1579    492    182    76    40    17      8 1     3
sdd         17043    1499    445    154    59    40    14      6 3     2
sde         17064    1506    415    152    68    32    15      4 6     3
sdf          17138    1467    396    152    46    37    11    10 4     4
sdg         17118    1493    401    139    56    31    14      7 2     4
sdh         16997    1577    407    138    62    45    11    10 7     6
sdi          19236        12        4        4      2      0      2      
3     0     1

Hopefully that will line up right !
So, out of the last 19265 seconds, each of the underlying drives was at 
100% for only a couple of seconds (sdi is the OS drive). ie, the last 
column shows the number of seconds the drive was at 90 to 100% util as 
reported by iostat. The 10 column shows number of seconds between 0 and 
10%, etc...

Looking at the primary, with all 520 series drives (except sda which is 
a 545s series) and the DRBD drives I see this:

Drive           10         20        30     40     50    60    70 80    
90    100
drbd0        19971    108       54     36     13      2      0 2      1 
     1
drbd1        19842    165       77     48     34      4      6 5      3  
    3
drbd10      19766    279       62     35     23      7      6 4      
2      1
drbd11      20081    37         32     21     12      1      3 1      
0      0
drbd12      20041    79         38     19       9      1      0 0      
1      0
drbd13      16195    2335   758   338   220  131    77    39 32    58
drbd14      19765    230       90     49     30      9      4 6      2   
   1
drbd15        3473    6323 4136 2250 1390  913  614  443  418  220
drbd17      20175    9             1       0       3      0      0      
0      0      0
drbd18      19878    170       65     29     23    10      4 0      6      1
drbd19      19255    368     138     86     87  100    39    35 44    35
drbd2        20188    0             0       0       0      0      0     
0      0      0
drbd3        17457    1276   610   316   175  140    66    43 33    56
drbd4        20154    17          6        6       5      0      0     
0      0      0
drbd5        19859    141      59      38     26    10      4 5      3    42
drbd6        20112    39        20        9       3      1      1   1    
  1      0
drbd7        20188    0            0        0       0      0      0     
0      0      0
drbd8        19894    136      78      44     22      5      3 2      0  
    2
drbd9        19476    289    211    123     41    21      9      6     
3      7
md1          20188    0            0        0       0      0      0     
0      0      0
sda           16948    1696   439    286   213  206  316    81 3      0
sdb           16059    2177   844    402   290  352    50    13 1      0
sdc           16141    2132   852    388   312  328    30      5    0      0
sdd           15914    2182   956    395   300  362    72      6    1      0
sde           16099    2137   801    393   256  366  124    10 1      1
sdf            16000    2169   898    408   322  340    39      9    3   
   0
sdg           15929    2265   822    418   259  290  195      8 2      0
sdh           16107    2129   822    419   324  337    41      9    0      0
sdi            20155    3             3        7     14      6 0      
0      0      0

So on the primary, I see even less of a bottleneck on the underlying 
drives, which doesn't make a lot of sense to me. The secondary has less 
read load (since all reads are handled by the primary), and should only 
need to deal with raid rmw. Also, I'm not sure, but I think the 
secondary does less meta data updates for DRBD. So I can only presume 
the new drives are much better than the 530 series, but still not as 
good as the 520 series. I'll need to run some tests before I put the 
drives live next time.

However, the point of note is that DRBD devices are showing high util 
levels much more frequently than the underlying devices, so I can only 
assume that the current limitation is caused by DRBD rather than the 
drives. Though probably solving the DRBD issue will then go back to the 
drives being the limit, with not a lot of difference. See below for my 
(your) ideas on improving both of those things.....

   https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
   http://www.spinics.net/lists/ceph-users/msg25928.html
   https://www.redhat.com/en/resources/ceph-pcie-ssd-performance-part-1
As one of those pages says the Samsung SM863 looks attractive,
but for historical reasons so far I have only seen Intel DCs in
similar use. There discussions of other models in various posts
related to Ceph journal SSD usage.

Obviously it's not something I can afford to buy one of each
and test them either.
In addition to the lists above I have justed tested my three
home flash SSDs:

* Micron M4 256GB:
     #  dd bs=4k count=100000 oflag=direct,dsync if=/dev/zero of=/var/tmp/TEST
     100000+0 records in
     100000+0 records out
     409600000 bytes (410 MB) copied, 1200.3 s, 341 kB/s
* Samsung 850 Pro 256GB:
     #  dd bs=4k count=100000 oflag=direct,dsync if=/dev/zero of=/var/tmp/TEST
     100000+0 records in
     100000+0 records out
     409600000 bytes (410 MB) copied, 1732.93 s, 236 kB/s
* Hynix SK SH910 256GB:
     #  dd bs=4k count=100000 oflag=direct,dsync if=/dev/zero of=/var/tmp/TEST
     100000+0 records in
     100000+0 records out
     409600000 bytes (410 MB) copied, 644.742 s, 635 kB/s

So I would not recommend any of them for "small sync writes"
workloads :-), but they are quite good otherwise. I do notice
they are slow on small sync writes when downloading mail, as
each message is duly 'fsync'ed.

BTW as bonus material, I have done on the SH910 an abbreviated
test with block sizes between 4KiB and 1024KiB:

   #  for N in 4k 16k 64k 128k 256k 512k 1024k; do echo -n "$N: "; dd bs=$N count=1000 oflag=dsync if=/dev/zero of=/var/tmp/TEST |& grep copied; done
   4k: 4096000 bytes (4.1 MB) copied, 6.23481 s, 657 kB/s
   16k: 16384000 bytes (16 MB) copied, 6.29379 s, 2.6 MB/s
   64k: 65536000 bytes (66 MB) copied, 6.09223 s, 10.8 MB/s
   128k: 131072000 bytes (131 MB) copied, 6.5487 s, 20.0 MB/s
   256k: 262144000 bytes (262 MB) copied, 6.93361 s, 37.8 MB/s
   512k: 524288000 bytes (524 MB) copied, 7.73957 s, 67.7 MB/s
   1024k: 1048576000 bytes (1.0 GB) copied, 12.8671 s, 81.5 MB/s

Note how the time to write 1000 blocks is essentially the same
betweeen 4KiB and 128KiB, which is quite amusing. Probably the
flash-page size is around 256KiB.

For additional bonus value the same on a "fastish" consumer 2TB
disk, a Seagate ST2000DM001:

   #  for N in 4k 16k 64k 128k 256k 512k 1024k; do echo -n "$N: "; dd bs=$N count=1000 oflag=dsync if=/dev/zero of=/fs/sdb6/tmp/TEST |& grep copied; done
   4k: 4096000 bytes (4.1 MB) copied, 44.9177 s, 91.2 kB/s
   16k: 16384000 bytes (16 MB) copied, 38.131 s, 430 kB/s
   64k: 65536000 bytes (66 MB) copied, 35.8263 s, 1.8 MB/s
   128k: 131072000 bytes (131 MB) copied, 35.8188 s, 3.7 MB/s
   256k: 262144000 bytes (262 MB) copied, 36.6838 s, 7.1 MB/s
   512k: 524288000 bytes (524 MB) copied, 37.0612 s, 14.1 MB/s
   1024k: 1048576000 bytes (1.0 GB) copied, 42.0844 s, 24.9 MB/s

Yep, definitely won't be going backwards to spinning disks :)

* Relax the requirement for synchronous writes on *both* the
   primary and secondary DRBD servers, if feeling lucky.
I have the following entries for DRBD which were suggested by
linbit (which previously lifted performance from abysmal to
more than sufficient around 2+ years ago). [ ... ]
That's an inappropriate use of "performance" here:

          disk-barrier no;
          disk-flushes no;
          md-flushes no;
That "feeling lucky" list seems to me to have made performance
lower (in the sense that the performance of writing to
'/dev/null' is zero, even if the speed is really good :->).

With those settings the data sync policy is "disk-drain", which
also involves some waiting, but somewhat dangerous, except "In
case your backing storage device has battery-backed write cache"
(and "device" here means system and host adapter and disk); it
is not clear to me for metadata what "md-flushes no" gives.

BTW if you have battery-backed everything on the secondary side
you could use protocol "B".
From my understanding, the times these settings can cause a problem:
1) When both servers hard power off - possibly all the latest data is 
not written to disk that the VM's expect. If this is the case, all the 
VM's were also hard powered off, and so the VM has no idea about what it 
expects to be written/not. The end user may need to redo some work/etc, 
but that is acceptable. Worst case scenario, a DB file is corrupted and 
needs to be restored from the previous night backup, and users must redo 
all work, which is also "acceptable" (from a risk point of view).
2) One server hard power off, perhaps power supply failure/etc - When it 
powers on again, it should re-sync with the DRBD primary, and 
potentially we do a DRBD verify to confirm everything is good. As long 
as there is no failure on the primary, then everything is good. Worst 
case, catastrophic failure of the primary before the verify is complete, 
or before the secondary comes on-line again, and basically we treat it 
as above.

We can't deal with every possible scenario, as the cost is prohibitive, 
we can only deal with the more common scenarios, and those that are 
cheaper to deal with. eg, all equipment is protected by UPS, using 
redundancy RAID instead of linear/striping, and using DRBD for 
replication. Most likely failures are disk, power supply, or network 
cables (ie, unplugged by accident/etc), and this setup protects well for 
all three of those.
However given those it looks likely that the bottleneck is also
on the primary DRBD side.

Do you have any other suggestions or ideas that might assist?
* Smaller RAID5 stripes, as in 4+1 or 2+1, are cheaper in space
   than RAID10 and enormously raise the chances that a full
   stripe-write can happen (it still has the write-hole problem
   of parity RAID).
I was planning to upgrade to the 4.4.x kernel, which would kind of solve 
this, since it will only read from 2 drives anyway, but it turns out 
that is more difficult than I expected. (iscsitarget kernel module 
doesn't compile cleanly with the new kernel, and it doesn't seem to be 
well supported into such recent kernel versions. I'll probably wait 
until debian testing becomes stable, or at least a lot closer, before 
going down that path).

I could potentially move to 2 x RAID5 with 3+1 and then linear or stripe 
those, which means I only lose one extra disk of capacity.... Will need 
to think about that further...

* Make sure the DRBD journal is also on a separate device that
   allows fast small sync writes.

I think this would be the next option to investigate. Currently the DRBD 
journal is on the same devices.
Reading from: 
http://www.drbd.org/en/doc/users-guide-84/ch-internals#s-internal-meta-data

*Advantage. *For some write operations, using external meta data 
produces a somewhat improved latency behavior.

Do you have any more knowledge on the expected performance advantage? 
ie, would half the writes move from the data drive to the meta data drive?
I'm thinking it might be plausible to purchase 2 x Intel P3700 400GB and 
put one in each DRBD server for the meta data updates. Although if this 
isn't going to make much difference (eg, only 20%) then it is less 
likely to be worthwhile...
Can anyone suggest what kind of performance improvement might I see by 
doing this?
The alternative (for double the cost + a bit more) would be to migrate 
from RAID5 to RAID10, is that likely to produce a better/worse result?

2 x P3700 400GB is probably around $2500, while 12 x 545s 1000GB is 
around $4800, but would need to add another SATA controller card, which 
probably means changing motherboard/CPU/etc as well, so that becomes a 
lot more....

Also, I have appended a sample DRBD configuration I have used:

----------------------------------------------------------------

     # http://article.gmane.org/gmane.linux.network.drbd/18348
     # http://www.drbd.org/users-guide-8.3/s-throughput-tuning.html
     # https://alteeve.ca/w/AN!Cluster_Tutorial_2_-_Performance_Tuning
     # http://fghaas.wordpress.com/2007/06/22/performance-tuning-drbd-setups/
     sndbuf-size		    0;
     rcvbuf-size		    0;
     max-buffers		    16384;
     unplug-watermark	    16384;
     max-epoch-size	    16384;
I have similar values, but will need to investigate the above options 
further. rcvbuf-size doesn't seem to be well documented, at least in the 
DRBD 8.4 manual, but will research these some more. Then will also need 
to check how to modify the values without causing a system meltdown....

Thanks again for your advice/information, it is very helpful.

Regards,
Adam

--
Adam Goryachev Website Managers www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html