raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)

"Graeme B. Bell" <grb@xxxxxxxxxxxxxxxxx> · Thu, 21 May 2015 11:21:49 +0000

> Not using your raid controllers write cache then?  Not sure just how important that is with SSDs these days, but if you've got a BBU set it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache if Bad BBU" if you do that.

I did quite a few tests with WB and WT last year. 

- WT should be OK with e.g. Intel SSDs.    From memory I saw write performance gains of about 20-30% with Crucial M500/M550 writes on a Dell H710 RAID controller. BUT that controller didn't have WT fastpath though which is absolutely essential to see substantial gains WT. I expect with WT and a fastpath enabled RAID you'd see much higher numbers, e.g. 100%+ higher IOPS. 

(So, if you don't have fastpath on your controller, you might as well plan to leave WB on and just buy cheaper SSD drives rather than expensive ones - the raid controller will be your choke point for performance on WT and it's a source of risk). 

- WT with most SSDs will likely corrupt your postgres database the first time you lose power. (on all the drives I've tested)

- WB is the only safe option unless you have done lots of plug pull tests on a drive that is guaranteed to protect data "in flight" during power loss (Intel disks +  maybe the new samsung pcie). 

A relevant anecdatum...

A certain company makes ssd drives, for talking sake let's call two of their models the XY00 and the XY50. These were popular SSD drives that were advertised everywhere as having power loss protection throughout 2013-2014. We bought them lots of them here because of that 'power loss protection' aspect + a good price + performance + good reliability record + the good name of the company.

When I tested with the famous 'diskchecker.pl' tool (* link at end), I found that they don't actually provide full power loss protection. Some data in flight (even fsyncs!) was lost. 
I tested using several computers, several copies of each disk model, with "XY00" and "XY50" models, and with and without RAID controllers. 

The only way I could keep the data safe for fsyncs and DB use with these drives during power failure was either a) use a RAID controller with WB   or b) disable ssd cache, which is horrifyingly bad for performance.

So I wrote to the company's engineering in early August 2014 about this (because we had spent quite a lot of money on these disks) and corresponded with a QA engineer to show them my results and show them how to reproduce the data loss problem, asking if maybe they could produce a firmware patch or some other fix.

At first they were extremely interested to know more. Then once they had the information to fully reproduce the bug, they went silent and wouldn't reply to any emails.

About 1-2 months later, articles started appearing on enthusiast tech sites. Not new firmware, just company product reps explaining that "power loss protection" doesn't really mean all your data is protected from power loss, and that it's unreasonable to expect the drive to do what it says on the box. 

:-(

Lessons to takeaway:  

--- WT + many SSDs + power loss = likely DB corruption. 

--- No raid card + many SSDs + power loss = likely DB corruption. 

--- WB + many SSDs + power loss = should be fine but you must test it a few times. 

--- Never use WT mode on any production system until you've run a ton of tests on the drive's ability to honor fsyncs.

--- Never trust any vendor to provide correctly working equipment regardless of how often they make promises in advertising. Buy the smallest amount possible and test it first yourself in the most realistic environment possible.  That goes for RAID controllers advertised as having fastpath , which actually didn't, and ssds heavily advertised as having power loss protection, which actually didn't protect all data from power loss. 

--- oh and NEVER do a power loss test by holding the power button. On every machine I've tested with SSDs, a power button shutdown (e.g. hold power for 5 seconds till it turns off) did not create lost data, whereas a plug pull test (yank the power out of the power supply) always produced lost data. The plug pull test reproduces a real life power failure more accurately. The power button test will only give you an illusion of safety. 

Graeme Bell

p.s. https://gist.github.com/bradfitz/3172656      - diskchecker.pl

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin