[ ... ] > However, I can't get the budget for those really awesome > drives up the top of the list, that would require around > $20k... or more. > For now, I've got 16 x 545s TB drives, and have replaced the > first half (ie, all drives in one server). Now I can see that > the drives themselves don't seem to be the bottleneck (the > drives don't run at 100% util, while the DRBD device does run > at 100%). The "%util" number is not that easy to interpret, especially for flash SSD and in some situations which probably include this one: https://brooker.co.za/blog/2014/07/04/iostat-pct.html > Hopefully that will line up right ! It is hard to read, and I don't understand what the numbers are, but it does not matter a lot. > So I can only presume the new drives are much better than the > 530 series, but still not as good as the 520 series. The 540s have an SLC write buffer as I mentioned previously, which should help. > However, the point of note is that DRBD devices are showing > high util levels much more frequently than the underlying > devices, so I can only assume that the current limitation is > caused by DRBD rather than the drives. I like guessing, but this assumptions seems to me a bit excessive. > From my understanding, the times these settings can cause a > problem: [ ... ] If you don't have reliable sync barriers at all levels (not just DRBD) *any* crash (e.g. bug crash, mistake-crash, memory-full crash, not just power crash) is going to cause massive trouble, especially in a mostly-write workload where what is being written is cache spill. Some interesting pages: http://blog.2ndquadrant.com/intel_ssd_now_off_the_sherr_sh/ http://wiki.postgresql.org/wiki/Reliable_Writes http://archive.is/WTeAE https://news.ycombinator.com/item?id=6973179 http://lkcl.net/reports/ssd_analysis.html >>> Do you have any other suggestions or ideas that might >>> assist? Another one that would likely give a bit of relief as you can't budget for write-optimized "enterprise" flash SSDs is a SATA/SAS host adapter with a very large battery-backed RAM buffer. As the tests that I previously mentioned show, longer writes result in much improved write rates on "consumer" flash SSD devices, and hopefully the large buffer results in: #1 When the large write buffer flushes, *hopefully* much longer writes to the flash SSD will happen on average. #2 Thanks to the battery backing, writes are reporte completed to the OS when they reach the host adapter's buffer, rather than the flash SSD layer. If #1 does not happen #2 won't help much if writes are at the flash SSD saturation level, only if they are bursty and on average below it. >> * Smaller RAID5 stripes, as in 4+1 or 2+1, are cheaper in >> space than RAID10 and enormously raise the chances that a >> full stripe-write can happen (it still has the write-hole >> problem of parity RAID). > I was planning to upgrade to the 4.4.x kernel, which would > kind of solve this, [ ... ] The write hole workaround in MD RAID relies on a mostly-write journal device like for DRBD. >> * Make sure the DRBD journal is also on a separate device >> that allows fast small sync writes. > I think this would be the next option to investigate. > Currently the DRBD journal is on the same devices. That means that every sync'ed write becomes two writes to the same device. > 2 x P3700 400GB is probably around $2500, The Samsung SM863 I ahve already mentioned are write-optimized too and much cheaper, at around $300-350 for the 480GB model. > while 12 x 545s 1000GB is around $4800, [ ... ] Many people try to use "consumer" drives to build manager wowing systems that have huge capacity and low cost, but vendors are not stupid, and make sure that premium priced "enteprise" drives have some critical advantage for at least some important workloads (usually write heavy, guessing that "enterprise" workloads that can command premium prices are transactional); sometimes like for SSDs the advantages are based on real stuff, capacitors and overprovisioning, which do cost money, sometimes artificial like disabling SCT/ERC control. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html