Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)

"Graeme B. Bell" <grb@xxxxxxxxxxxxxxxxx> · Mon, 1 Jun 2015 09:10:01 +0000

Hi Bruce,

> It is my understanding that write-through is always safe as it writes
> through to the layer below and waits for acnoledgement.  Write-back
> doesn't, so when you say:

I said WT twice, and I assure you for a third time - in the tests I've carried out, Crucial SSD disks, **WT** was not safe or reliable whereas BATTERY-BACKED WB was.

I believe it is confusing/surprising to you because your beliefs about WT and WB and about persistence of fsyncs to SSD disks are inconsistent with what happens in reality. 
Specifically: ssds tell lies to the sata/raid controller about what they're really doing, but unfortunately WT mode trusts the ssd to be honest.

Hopefully this will help explain what's happening:

1. Historically, RAID writeback caches were unreliable for fsyncs because with no battery to persist data (and no nonvolatile WB cache, as we see on modern raid controllers), anything in the WB cache would be lost during a power fail. So, historically your safe options were: WB cache with battery (safe since the cache 'never loses power'), or WT to disk (safe if the disk can be trusted to persist writes through a power loss).

2. If you use WT at the raid controller level, then 'in principle' an fsync call should not return until the data is safely on the drive. For postgres, the fsync call is the most important thing. Regular data writes can fail and crash recovery is possible via WAL replay. But if fsync's don't work properly, crash recovery is probably not possible. 

3. In reality, on most SSDs, if you use WT on the RAID controller to go direct to disk, and make an fsync call, then the drive's controller tells you the data is persisted while behind the scenes the data is actually held in a volatile writeback cache and is vulnerable to catastrophic loss during a power failure. 

4. So basically: the SSD drive's controller lies about fsync and does not honor the 'fsync contract'. 

5. It's not just SSDs. If you are using e.g. a NAS system , perhaps storing your DB over NFS, then chances are the NAS is almost certainly lying to you when you make fsync calls.

6. It's not just SSDs and NAS systems. If you use a virtual server / VMware, then to assist performance, the virtualised disk may be lying about fsync persistence. 

7. Why do SSDs, NAS systems and VMs lie about fsync persistence? Because it improves the apparent performance, benchmarks and so on, at the expense of corruption which happens only rarely during power failures. In some applications this is acceptable e.g. network file shares for powerpoints and word documents, mailservers... but for postgres, it's not acceptable. 

8. The only way to get a real idea about WT and WB and what is happening under the hood, is to fire up diskchecker.pl and measure the result in the real world when you pull the plug (many times). Then you'll know for sure what happens in terms of performance and write persistence. You should not trust anything you read online written by other people especially by drive manufacturers - test for yourself if your database matters to you. Remember to use a 'real' plug pull - yank the cord out of the back - don't simply power off with the button or you'll get incorrect results. That said, the Intel drives have a great reputation and every report I've read indicates they work correctly with WT or direct connections. 

9.That said - with non-Intel SSD drives, you may be able to use WT if you turn off all caching on the SSD, which may stop the SSD lying about what it's doing. On the disks I've tested, this will allow fsync writes to hit the disk correctly in WT mode.

However! the costs are enormous - substantial increase in ssd disk wear (e.g. much earlier failure) and massive decrease in performance (96% drop e.g. you get 4% of the performance out of your disk that you expect - it can actually be slower than an HDD!). 

Example (from memory but I'm very certain about these numbers): I measured ~100 disk operations per second with diskchecker.pl and crucial M500s when all disk cache was disabled and WT used. This compared with ~2000-2500 diskchecker operations per second with a *battery-backed* WB cache and disk cache enabled. In both cases, data persisted correctly and was not corrupted during repeated plug-pull testing. 

For what it's worth: A similar problem comes up with raid controllers and raid and probabiities of raid failure, and the massive gap between theory and practice. 

You can read any statistic you like about the likelihood of e.g. RAID6 failing due to consecutive disk failures (e.g. this ACM paper taking into account UBE errors with RAID5/6 suggests 0.00639% risk of loss per year, http://queue.acm.org/detail.cfm?id=1670144), but in practice what kills your database is probably the raid controller card failing or having a bug or unreliable battery.... (e.g. http://www.webhostingtalk.com/showthread.php?t=1075803, e.g. 25% failure rates on some models!). 

Graeme Bell

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin