Re: RAID5 Performance

Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> · Thu, 28 Jul 2016 23:50:06 +1000

On 28/07/2016 22:19, Peter Grandi wrote:
[ ... ]

A very brave configuration, a shining example of the
"syntactic" mindset, according to which any arbitrary
combination of legitimate features must be fine :-).
While you may say that this configuration is very "brave", it
is actually quite common for VDI "appliance" deployments. [
... ]
There are a lot of very "brave" sysadms out there, and often I
have to clean up after them :-).

But then I am one of those boring people who think that «VDI
"appliance" deployments» are usually a phenomenally bad idea, as
it requires a storage layer that has to cover all possible IO
workloads optimally, as indeed in:
Could I ask what you would "clean up" in the above system? What layers 
would you remove/simplify?
At it's simplest, this system should be able to export a block of disk 
space which the remote machine can present as a block device to the VM 
(using Xen). Keep in mind there are actually multiple of these remote 
machines. The method I've chosen is re-written here:
8 x 480GB Intel SSD (mix of 520 and 530 models)
Linux MD RAID5
LVM2
DRBD (takes an LV from each san and joins it together)
iSCSI (exports the block device to the 10Gbps network)
iSCSI (imported to the remove machine 2 x 1Gbps network)
multipathd (join the two iSCSI connections together)
Xen

Also, the two san machines have a second 10Gbps connection directly 
between them.

Originally I had DRBD below LVM, but the folks at linbit switched those 
two around to improve DRBD performance (multiple smaller DRBD devices is 
better than one large, this might have changed since that happened 
around 3 years ago).
   > [ ... ] The expectation, in terms of performance for VDI is
   > quite high.  vmWare like to say you can get away with 8-12
   > IOPS per virtual. Most people think you only get good
   > performance with 100 IOPS per virtual. [ ... ]

Those 100 random IOPS per VM are a bit "random", but roughly
translate to one "disk arm" per VM, which is not necessarily
enough: http://www.sabi.co.uk/blog/15-one.html#150305

[ ... ]

The queue sizes and waiting time on the second server are
very low (on a somewhat similar system using 4TB disks I see
waiting times in the 1-5 seconds range, not milliseconds).
The expectation, in terms of performance for VDI is quite high.
[ ... ]
Sure, but the point here as to the speed issue is not that the
SSDs are overwhelmed with IO, as the traffic on them is low and
has relatively low latency, it is that very few IOPS are getting
retired.

Previously when I was doing lots of tests on the system, I found I could 
get great IO using larger block sizes, up to 2.5GB/s read and 1.5GB/s write.
Eventually, I found that using more real-world sized blocks, eg, 1k to 
4k, I got abysmal transfer rates.

Thus the most likely issue here is the 'fsync' problem: for
"consumerish" SSDs barrier-writes are synchronous, because
they don't have a battery/capacitor-backed cache, and rather
slow for small writes, because of the large size of erase
blocks, which can be mitigated with higher over-provisioning.
On many consumer SSDs, barrier writes are only barriers, and
are not syncs at all. You are guaranteed serialization but not
actual storage.
Probably in this case that is irrelevant, because the numbers
coming out from both the OP's experience and the tests in the
links I mentioned show that small sync writes seem synchronous
indeed for the 520/530, resulting in small write rates of around
1-5 MB/s, which matches the reported stats.

Then again, in a server setup, especially with redundant power
supplies, power loss to the SSDs is rare. You are more
protecting against system hangs and other inter-connectivity
issues.
That is also likely irrelevant here. The firmware in the flash
SSD does not know about the system setup, and the DRBD is
probably configured to request synchronous writes on the
secondary with protocol "C".
Yep, using protocol C at the moment.
BTW I don't know whether the process(es) writing to the DRBD
primary also request synchronous writes, but that's hopefully the
case too, if the VD layer has been configured properly.

The real system solution is to have some quantity of non
volatile DRAM that you can stage writes (either a PCI-e card
like a FlashTec or one or more nvDIMMs).
If this were the case then the VD layer and the DRBD layer could
be told not to use sync writes, but the numbers reported seem to
indicate that sync writes are happening.

This is how the "major vendors" deal with sync writes.
At the system level, but at the device level the "major vendors"
put a large capacitor in "enterprise" SSDs for two reasons, one
of them to allow the persistence of the RAM write buffer, to
minimize write amplification and erase latency (the other is not
relevant here).

[ ... ]

* Small writes are a very challenging workload for flash SSDs
   without battery/capacitor-backed caches.
Even with battery backup, small writes create garbage
collection, so while batteries may give you some short term
bursts,
That problem is mitigated with bigger overprovisioning in
"enterprise" class flash SSDs. It can also be done in those of
the "consumerish" class by partitioning them appropriately, or
with 'hdparm -N'; but that does not seem to be the case here,
becase the reported stats show a small number of IOPS with lowish
queues sizes and not that huge latencies.
Can you advise what numbers I should look for, or worry about, which 
would indicate that the problem is (or isn't) a erase cycle delay problem?
longer term, you still have to do the writes.
Unfortunately flash SSDs don't merely have to "do the writes",
as things are quite different: as I mentioned above the issue is
the large erase blocks (and the several milliseconds it takes to
erase one).

In the absence of power backing for the write cache, every sync
write, for example a 4KiB one, is (usually) stored immediately to
a flash chip, which means (usually) a lot of write amplification
because of RMW on the 8MiB (or larger) erase block plus the large
latency (often near 10 milliseconds) of the erase operation
before erase block programming.

That largely explains why in the tests I have mentioned small
sync write IOPS for many "consumerish" flash SSDs top at around
100, instead of the usual > 10,000 for small non-sync writes.

Some flash SSDs use an additional SLC buffer with smaller erase
blocks and lower latency to reduce the problem with flushing sync
writes directly to MLC etc, and that may explain why the 520s are
better than the 530s (if the 520s have an SLC buffer, but IIRC
intel started using an SLC buffer with the 540 series).
I can see the 545s series performs similar (or better, can't tell yet) 
than the 520 series, but certainly it is better than the 530.

Flash SSDs have only been popular for around 5 years, so it is
understandable that some important aspects of their performance
envelope (like what may happen on sync writes) is not well known
yet.

Thank you, I appreciate all the responses.

So far, I've decided to make the following two changes:
1) Replace all 16 existing SSD's with the 1000GB 545s model, this will 
double the capacity, and remove any of the 530 model drives. My concern 
was (and is) that making actual use of this double capacity with the 
same performance per drive will in effect halve the performance. I will 
be able to leave 40G per drive un-partitioned, or partitioned and left 
blank whichever is better.... Not sure if 40G per drive is enough to 
help with the write/erase problem, but I guess it should be better than 
nothing.
I think this change will produce a 40% improvement (potentially, given 
the 520 drives are at 40% while the 530 is at 100%)

2) Upgrade to Linux kernel 4.6 from debian backports.
I think this change will give approx 30% improvement, because it will 
reduce reads by 4 for each write, but a read is quicker than a write, so 
I'm hoping for 30% overall. It sounds like the performance should be 
somewhere between current and RAID10.

With the above two improvements, I'm hoping it will be enough to solve 
the problem.

At this stage, if it is not enough to solve the problem, my fall-back 
option is to convert to RAID10 but it's something I'd prefer to avoid 
based on cost, storage capacity, and the fact it is difficult to expand 
the existing system past 8 drives (hence capacity)...

I'm not convinced that changing chunk size will offer any benefit 
(positive or negative), so will likely leave that as it is (64k chunk size).

Regards,
Adam
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html