[ ... ] >> A very brave configuration, a shining example of the >> "syntactic" mindset, according to which any arbitrary >> combination of legitimate features must be fine :-). > While you may say that this configuration is very "brave", it > is actually quite common for VDI "appliance" deployments. [ > ... ] There are a lot of very "brave" sysadms out there, and often I have to clean up after them :-). But then I am one of those boring people who think that «VDI "appliance" deployments» are usually a phenomenally bad idea, as it requires a storage layer that has to cover all possible IO workloads optimally, as indeed in: > [ ... ] The expectation, in terms of performance for VDI is > quite high. vmWare like to say you can get away with 8-12 > IOPS per virtual. Most people think you only get good > performance with 100 IOPS per virtual. [ ... ] Those 100 random IOPS per VM are a bit "random", but roughly translate to one "disk arm" per VM, which is not necessarily enough: http://www.sabi.co.uk/blog/15-one.html#150305 [ ... ] >> The queue sizes and waiting time on the second server are >> very low (on a somewhat similar system using 4TB disks I see >> waiting times in the 1-5 seconds range, not milliseconds). > The expectation, in terms of performance for VDI is quite high. > [ ... ] Sure, but the point here as to the speed issue is not that the SSDs are overwhelmed with IO, as the traffic on them is low and has relatively low latency, it is that very few IOPS are getting retired. >> Thus the most likely issue here is the 'fsync' problem: for >> "consumerish" SSDs barrier-writes are synchronous, because >> they don't have a battery/capacitor-backed cache, and rather >> slow for small writes, because of the large size of erase >> blocks, which can be mitigated with higher over-provisioning. > On many consumer SSDs, barrier writes are only barriers, and > are not syncs at all. You are guaranteed serialization but not > actual storage. Probably in this case that is irrelevant, because the numbers coming out from both the OP's experience and the tests in the links I mentioned show that small sync writes seem synchronous indeed for the 520/530, resulting in small write rates of around 1-5 MB/s, which matches the reported stats. > Then again, in a server setup, especially with redundant power > supplies, power loss to the SSDs is rare. You are more > protecting against system hangs and other inter-connectivity > issues. That is also likely irrelevant here. The firmware in the flash SSD does not know about the system setup, and the DRBD is probably configured to request synchronous writes on the secondary with protocol "C". BTW I don't know whether the process(es) writing to the DRBD primary also request synchronous writes, but that's hopefully the case too, if the VD layer has been configured properly. > The real system solution is to have some quantity of non > volatile DRAM that you can stage writes (either a PCI-e card > like a FlashTec or one or more nvDIMMs). If this were the case then the VD layer and the DRBD layer could be told not to use sync writes, but the numbers reported seem to indicate that sync writes are happening. > This is how the "major vendors" deal with sync writes. At the system level, but at the device level the "major vendors" put a large capacitor in "enterprise" SSDs for two reasons, one of them to allow the persistence of the RAM write buffer, to minimize write amplification and erase latency (the other is not relevant here). [ ... ] >> * Small writes are a very challenging workload for flash SSDs >> without battery/capacitor-backed caches. > Even with battery backup, small writes create garbage > collection, so while batteries may give you some short term > bursts, That problem is mitigated with bigger overprovisioning in "enterprise" class flash SSDs. It can also be done in those of the "consumerish" class by partitioning them appropriately, or with 'hdparm -N'; but that does not seem to be the case here, becase the reported stats show a small number of IOPS with lowish queues sizes and not that huge latencies. > longer term, you still have to do the writes. Unfortunately flash SSDs don't merely have to "do the writes", as things are quite different: as I mentioned above the issue is the large erase blocks (and the several milliseconds it takes to erase one). In the absence of power backing for the write cache, every sync write, for example a 4KiB one, is (usually) stored immediately to a flash chip, which means (usually) a lot of write amplification because of RMW on the 8MiB (or larger) erase block plus the large latency (often near 10 milliseconds) of the erase operation before erase block programming. That largely explains why in the tests I have mentioned small sync write IOPS for many "consumerish" flash SSDs top at around 100, instead of the usual > 10,000 for small non-sync writes. Some flash SSDs use an additional SLC buffer with smaller erase blocks and lower latency to reduce the problem with flushing sync writes directly to MLC etc, and that may explain why the 520s are better than the 530s (if the 520s have an SLC buffer, but IIRC intel started using an SLC buffer with the 540 series). Flash SSDs have only been popular for around 5 years, so it is understandable that some important aspects of their performance envelope (like what may happen on sync writes) is not well known yet. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html