Re: RAID5 Performance

pg@xxxxxxxxxxxxxxxxxxx (Peter Grandi) · Thu, 28 Jul 2016 18:20:16 +0100

[ ... ]

>> * Replace the flash SSDs with those that are known to deliver
>>   high (at least > 10,000 single threaded) small synchronous
>>   write IOPS.

> Is there a "known" SSD that you would suggest? My problem is
> that Intel spec sheets seem to suggest that there is little
> performance difference across the range of SSD's, so it's
> really not clear which SSD model I should buy.

The links I wrote earlier have lists:

>>>   https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>>   http://www.spinics.net/lists/ceph-users/msg25928.html
>>>   https://www.redhat.com/en/resources/ceph-pcie-ssd-performance-part-1

As one of those pages says the Samsung SM863 looks attractive,
but for historical reasons so far I have only seen Intel DCs in
similar use. There discussions of other models in various posts
related to Ceph journal SSD usage.

> Obviously it's not something I can afford to buy one of each
> and test them either.

In addition to the lists above I have justed tested my three
home flash SSDs:

* Micron M4 256GB:
    #  dd bs=4k count=100000 oflag=direct,dsync if=/dev/zero of=/var/tmp/TEST
    100000+0 records in
    100000+0 records out
    409600000 bytes (410 MB) copied, 1200.3 s, 341 kB/s
* Samsung 850 Pro 256GB:
    #  dd bs=4k count=100000 oflag=direct,dsync if=/dev/zero of=/var/tmp/TEST
    100000+0 records in
    100000+0 records out
    409600000 bytes (410 MB) copied, 1732.93 s, 236 kB/s
* Hynix SK SH910 256GB:
    #  dd bs=4k count=100000 oflag=direct,dsync if=/dev/zero of=/var/tmp/TEST
    100000+0 records in
    100000+0 records out
    409600000 bytes (410 MB) copied, 644.742 s, 635 kB/s

So I would not recommend any of them for "small sync writes"
workloads :-), but they are quite good otherwise. I do notice
they are slow on small sync writes when downloading mail, as
each message is duly 'fsync'ed.

BTW as bonus material, I have done on the SH910 an abbreviated
test with block sizes between 4KiB and 1024KiB:

  #  for N in 4k 16k 64k 128k 256k 512k 1024k; do echo -n "$N: "; dd bs=$N count=1000 oflag=dsync if=/dev/zero of=/var/tmp/TEST |& grep copied; done
  4k: 4096000 bytes (4.1 MB) copied, 6.23481 s, 657 kB/s
  16k: 16384000 bytes (16 MB) copied, 6.29379 s, 2.6 MB/s
  64k: 65536000 bytes (66 MB) copied, 6.09223 s, 10.8 MB/s
  128k: 131072000 bytes (131 MB) copied, 6.5487 s, 20.0 MB/s
  256k: 262144000 bytes (262 MB) copied, 6.93361 s, 37.8 MB/s
  512k: 524288000 bytes (524 MB) copied, 7.73957 s, 67.7 MB/s
  1024k: 1048576000 bytes (1.0 GB) copied, 12.8671 s, 81.5 MB/s

Note how the time to write 1000 blocks is essentially the same
betweeen 4KiB and 128KiB, which is quite amusing. Probably the
flash-page size is around 256KiB.

For additional bonus value the same on a "fastish" consumer 2TB
disk, a Seagate ST2000DM001:

  #  for N in 4k 16k 64k 128k 256k 512k 1024k; do echo -n "$N: "; dd bs=$N count=1000 oflag=dsync if=/dev/zero of=/fs/sdb6/tmp/TEST |& grep copied; done
  4k: 4096000 bytes (4.1 MB) copied, 44.9177 s, 91.2 kB/s
  16k: 16384000 bytes (16 MB) copied, 38.131 s, 430 kB/s
  64k: 65536000 bytes (66 MB) copied, 35.8263 s, 1.8 MB/s
  128k: 131072000 bytes (131 MB) copied, 35.8188 s, 3.7 MB/s
  256k: 262144000 bytes (262 MB) copied, 36.6838 s, 7.1 MB/s
  512k: 524288000 bytes (524 MB) copied, 37.0612 s, 14.1 MB/s
  1024k: 1048576000 bytes (1.0 GB) copied, 42.0844 s, 24.9 MB/s

>> * Relax the requirement for synchronous writes on *both* the
>>   primary and secondary DRBD servers, if feeling lucky.

> I have the following entries for DRBD which were suggested by
> linbit (which previously lifted performance from abysmal to
> more than sufficient around 2+ years ago). [ ... ]

That's an inappropriate use of "performance" here:

>          disk-barrier no;
>          disk-flushes no;
>          md-flushes no;

That "feeling lucky" list seems to me to have made performance
lower (in the sense that the performance of writing to
'/dev/null' is zero, even if the speed is really good :->).

With those settings the data sync policy is "disk-drain", which
also involves some waiting, but somewhat dangerous, except "In
case your backing storage device has battery-backed write cache"
(and "device" here means system and host adapter and disk); it
is not clear to me for metadata what "md-flushes no" gives.

BTW if you have battery-backed everything on the secondary side
you could use protocol "B".

However given those it looks likely that the bottleneck is also
on the primary DRBD side.

> Do you have any other suggestions or ideas that might assist?

* Smaller RAID5 stripes, as in 4+1 or 2+1, are cheaper in space
  than RAID10 and enormously raise the chances that a full
  stripe-write can happen (it still has the write-hole problem
  of parity RAID).

* Make sure the DRBD journal is also on a separate device that
  allows fast small sync writes.

Also, I have appended a sample DRBD configuration I have used:

----------------------------------------------------------------

resource r0
{
  device		  /dev/drbd_r0 minor 0;
  # A: "local disk and local TCP send buffer"
  # B: "local disk and remote buffer cache"
  # C: "both local and remote disk"
  protocol		  C;

  net
  {
    # As mentioned on IRC by a DRBD guy, this is not really a
    # secret, but more a "unique id" that ensures that replicas
    # of different resources don't get accidentally connected.
    # Still to be ABR-ized.
    shared-secret	    "xxxxxxxxxxxx";

    cram-hmac-alg	    sha1;
    ping-timeout	    50;

    after-sb-0pri	    discard-zero-changes;
    after-sb-1pri	    discard-secondary;
    after-sb-2pri	    disconnect;

    # http://article.gmane.org/gmane.linux.network.drbd/18348
    # http://www.drbd.org/users-guide-8.3/s-throughput-tuning.html
    # https://alteeve.ca/w/AN!Cluster_Tutorial_2_-_Performance_Tuning
    # http://fghaas.wordpress.com/2007/06/22/performance-tuning-drbd-setups/
    sndbuf-size		    0;
    rcvbuf-size		    0;
    max-buffers		    16384;
    unplug-watermark	    16384;
    max-epoch-size	    16384;
  }

  syncer
  {
    csums-alg		    sha1;
    # At 45MB/s takes 6 hour per 1TB.
    rate		    95M;

    use-rle;
  }

  startup
  {
    wfc-timeout		    15;
    degr-wfc-timeout	    15;
    outdated-wfc-timeout    15;

    # Cannot be an address, must be output of 'hostname'.
    become-primary-on	    host-1;
  }

  on host-1
  {
    address		    192.168.1.11:7788;
    disk		    /dev/md2;
    flexible-meta-disk	    /dev/local0/r0_md;
  }

  on host-2
  {
    address		    192.168.1.12:7788;
    disk		    /dev/md2;
    flexible-meta-disk	    /dev/local0/r0_md;
  }
}
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html