Re: Intel SSD or other brands

Doug Dumitru <doug@xxxxxxxxxx> · Thu, 29 Dec 2016 10:50:42 -0800

Mr. Goryachev,

I find it easier to look at these numbers in terms of IOPS.  You are
dealing with 17,500 IOPS vs 100 IOPS.  This pretty much has to be the
"commit" from the drive.  The benchmark is basically waiting for the
data to actually hit "recorded media" before the IO completes.  The
"better" drive is returning this "write ACK" when the data is in RAM,
the the "worse" drive is returning this "write ACK" after the data is
somewhere much slower (probably flash).  I would note that 17,500 IOPS
is a "good" but not "great" number.

Doing commit writes to Flash is expensive.  Not only do you have to
wait for the flash, but you have to update the mapping tables to get
to the data.  Flash also does not typically allow 4K updates (even
given the erase rules), so your 4K sync update probably has to update
a 16K "page" is probably causing a lot of flash wear.  Maybe as much
as 10:1 write amplification.  Maybe more.

There are a bunch of things to consider when looking at "sync"
performance.  The easiest way to look at this is that the drive
"absolutely" has to have the data in stable storage to be "correct".
This is not really true, and the overhead of this can be huge.  File
systems "know" this behavior and instead of looking for a hard sync,
they use "barriers".  The idea of a barrier, is that the drive is
allowed to buffer writes, just not re-order them  so that an IO
crosses a "barrier".

Testing of SSDs for this is looking for "serialization errors".  If
you pull power from an SSD and then go look at the blocks that made it
to the media after the reboot, drives can work in one of three ways.
If absolutely every ACKd block is on the drive, then sync works and
barriers are not relevant.  If the writes stop and no "newer" write
made it to the drive when an "older" one did not, then the drive is
still OK with barriers.  If "newer" writes made it to the media but
older writes did not, then this is a serialization error and you have
spaghetti.  SSDs with power fail serialization errors are "bad".  Then
again, it is important to understand the system-level implications of
how the error will impact your stack.

In the case of RAID-5 in a "traditional" Linux deployment, and
especially with DRBD protecting you on another node, you are probably
fine without having every last ACK "perfect".  After all, if you power
fail the primary node, the secondary will become the primary, and any
"tail writes" that are missing will get re-sync'd by DRBD's hash
checks.  And because the amount of data being re-synced is small, it
will happen very quickly and you might not even notice it.

Back to performance, you should also consider what your array is doing
to you.  You are running an 8 drive raid-5 array.  This will limit
performance even more because every write becomes 2 sync writes, plus
6 reads.  With q=1 latencies, if you run this test on the array with
"good" drives, you should probably get about 15K IOPS max, but it
might be a bit worse as the read and write latencies add for each OP.

I tried your test on our "in house" "server-side FTL" mapping layer on
8 drives raid-5.  This is an E5-1650 v3 w/ an LSI 3008 SAS controller
and 8 Samsung 256GB 850 Pro SSDs.  The array is "new" so it will slow
down somewhat as it fills.  439K IOPS is actually quite a bit under
the array's bandwidth, but at q=1, you end up benchmarking the
benchmark program.  (at q=10, the array saturates the drives linear
performance at about 900K IOPS or 3518 MB/sec).

root@ubuntu-16-24-2:/usr/local/ess# fio --filename=/dev/mapper/ess-md0
--direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 --runtime=60
--time_based --group_reporting --name=ess-raid5 --numjobs=1
ess-raid5: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/1714MB/0KB /s] [0/439K/0
iops] [eta 00m:00s]
ess-raid5: (groupid=0, jobs=1): err= 0: pid=29544: Thu Dec 29 10:36:13 2016
  write: io=102653MB, bw=1710.9MB/s, iops=437980, runt= 60001msec
    clat (usec): min=1, max=155, avg= 2.06, stdev= 0.49
     lat (usec): min=1, max=155, avg= 2.10, stdev= 0.49
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[    1], 10.00th=[    2], 20.00th=[    2],
     | 30.00th=[    2], 40.00th=[    2], 50.00th=[    2], 60.00th=[    2],
     | 70.00th=[    2], 80.00th=[    2], 90.00th=[    3], 95.00th=[    3],
     | 99.00th=[    3], 99.50th=[    3], 99.90th=[    7], 99.95th=[    8],
     | 99.99th=[   11]
    bw (MB  /s): min= 1330, max= 1755, per=100.00%, avg=1710.84, stdev=39.12
    lat (usec) : 2=5.17%, 4=94.51%, 10=0.31%, 20=0.02%, 50=0.01%
    lat (usec) : 250=0.01%
  cpu          : usr=14.08%, sys=85.92%, ctx=47, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=26279247/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=102653MB, aggrb=1710.9MB/s, minb=1710.9MB/s,
maxb=1710.9MB/s, mint=60001msec, maxt=60001msec

On Wed, Dec 28, 2016 at 6:14 PM, Adam Goryachev
<mailinglists@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> Apologies for my prematurely sent email (if it gets through), this one is
> complete...
>
> Hi all,
>
> I've spent a number of years trying to build up a nice RAID array for my
> SAN, but I seem to be slowly solving one bottle neck only to find another
> one. Right now, I've identified the underlying SSD's as being a major factor
> in that performance issue.
>
> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this
> performed really well.
> I added 1 x 480GB 530s SSD
> I added 2 x 480GB 530s SSD
>
> I now found out that performance of a 520s SSD is around 180 times faster
> than a 530s SSD. I had to run many tests, but eventually I found the right
> things to test for (which matched my real life results), and the numbers
> were nothing short of crazy.
> Running each test 5 times and average the results...
> 520s: 70MB/s
> 530s: 0.4MB/s
>
> OK, so before I could remove and test the 520s, I removed/tested one of the
> 530s and saw the horrible performance, so I bought and tested a 540s and
> found:
> 540s: 6.7MB/s
> So, around 20 times better than the 530, so I replaced all the drives with
> the 540, but I still have worse performance than the original 5 x 520s
> array.
>
> Working with Intel, they swapped a 530s drive for a DC3510, and I then found
> the DC3510 was awesome:
> DC3510: 99MB/s
> Except, a few weeks back when I placed the order, I was told there is no
> longer any stock of this drive, (I wanted 16 x 800GB model), and that the
> replacement model is the DC3520. So I figure I won't just blindly buy the
> DC3520 assuming it's performance will be similar to the previous model, so I
> buy 4 x 480GB DC3520 and start testing.
> DC3520: 37MB/s
>
> So, 1/3rd of a DC3510, but still better than the current live 540s drives,
> but also still half the original 520s drives.
>
> Summary:
> 520s:   70217kB/s
> 530s:     391kB/s
> 540s:    6712kB/s
> 330s:      24kB/s
> DC3510: 99313kB/s
> DC3520: 37051kB/s
> WD2TBCB:  475kB/s
>
> * For comparison, I had a older Western Digital Black 2TB spare, and ran the
> same test on it. Got a better result than some of the SSD's which was really
> surprising, but it's certainly not an option.
> FYI, the test I'm running is this:
> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1
> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1
> --numjobs=1
> All drives were tested on the same machine/SATA3 port (basic intel desktop
> motherboard), with nothing on the drive (no fs, no partition, nothing trying
> to access it, etc..).
> In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1
> matches is the relevant number. At higher iodepth, we see performance on all
> the drives improve, if interested, I can provide a full set of my
> results/analysis.
>
> So, my actual question... Can you suggest or have you tested any Intel (or
> other brand) SSD which has good performance (similar to the DC3510 or the
> 520s)? (I can't buy and test every single variant out there, my budget
> doesn't go anywhere close to that).
> It needs to be SATA, since I don't have enough PCIe slots to get the needed
> capacity (nor enough budget). I need around 8 x drives with around 6TB
> capacity in RAID5.
>
> FYI, my storage stack is like this:
> 8 x SSD's
> mdadm - RAID5
> LVM
> DRBD
> iSCSI
>
> From my understanding, it is DRBD that makes everything a iodepth=1 issue.
> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO
> at the same time, but it usually a single VM performance that is too
> limited.
>
> Regards,
> Adam
>
>
> --
> Adam Goryachev Website Managers www.websitemanagers.com.au
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Doug Dumitru
EasyCo LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html