Re: raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)

"Graeme B. Bell" <grb@xxxxxxxxxxxxxxxxx> · Thu, 28 May 2015 13:37:48 +0000

Hi Bruce

I'm *extremely* certain of what I say when I say WB+BBU=good and direct WT=bad.

WB on the controller uses the battery backed RAID controller cache for writes to ensure all writes do eventually get written to the disk in the event of a power failure. 

WT on the controller bypasses the battery backed cache and sends writes directly to the SSD. If the SSD doesn't have its own sufficient capacitor backing, they're gone.
See the manual page quote below. 

- With H710 WT, ssd cache enabled, the SSDs I tested were proven to lose data that was meant to have been already fsync'd. The capacitor was insufficient and the firmware lied about performing an fsync.

- With H710 WB, ssd cache enabled, the SSDs didn't lose writes, I have yet to see a failed fsync in any of the many dozens of tests I ran on several machines and disks*. 

- Without H710 and ssd cache enabled i.e. WT direct to drive, I always lost writes that were meant to be fsync'd. 

- Without H710 and ssd cache disabled, I never lost writes. 

There are two possible reasons the writes always hit the drive successfully in every test with controller WB and ssd disk cache enabled. 

Either a) SSDs perhaps did lose fsync'd data, but the controller didn't. The battery backed raid controller 512MB-1024MB cache ensured fsync'd writes were completed by reinitiating the write after power up, as it would with a hard drive after power loss. I have not been able to find sufficiently detailed technical documentation for these cards to find out exactly what they do after power-loss in terms of disk communication, replaying writes, etc. I only have my measured results.

however it's also possible that  ... b) From extensive plug-pull testing it appears the capacitors in the Crucial drives are just *slightly* too small to save all data in flight, there is always a very tiny number of fsync'd writes that don't make it to disk. So it is entirely possible that the *fairly slow* writeback cache on the Dell controller, which substantially *reduces* the IOPS of the ssd, is consistently limiting the amount of data held the cache on the ssd, such that all the data can be saved using the caps on the ssd disk. By effectively running at half-speed with the raid controller cache as a choke point, you are never in a situation where the last few writes just don't quite make it to disk, because half the ssd cache is sitting empty.

Also, keep in mind I am reporting results from many dozens of runs of diskchecker.pl. It is possible that writes may be lost in a way that diskchecker.pl does not detect. It is also possible that there is e.g. a 1 in 1000 situation I haven't found yet. For example, really heavy sequential writes with interspersed fsyncs rather than just heaps of fsyncs. 

For the avoidance of doubt:

In several afternoons of testing, I have *never* managed to lose fsync'd data from crucial m500/m550 disks combined with a battery backed raid controller in writeback mode. (WB).
In several afternoons of testing, I have *always* lost a small amount of fsync'd data from crucial m500/m550 disks combined with a battery backed raid controller in write-through mode. (WT).

I and others bought the M500/M550 on the back of the advertised capacitor backed cache, but I wouldn't ever trust manufacturer capacitor claims in future. Drive power failure recovery really is something that needs testing by many customers to ascertain the truth of the matter. (e.g. long story short, I recommend people to buy an intel model now since it has proven most trustworthy in terms of manufacturer claims). 

Graeme Bell

https://www.dell.com/downloads/global/products/pvaul/en/perc-technical-guidebook.pdf

"DELL PERC H700 and H800 Technical Guide 20
4.10 Virtual Disk Write Cache Policies
The write cache policy of a virtual disk determines how the controller handles writes to that virtual disk. Write-Back and Write-Through are the two write cache policies and can be set on virtual disks individually.
All RAID volumes will be presented as Write-Through (WT) to the operating system (Windows and Linux) independent of the actual write cache policy of the virtual disk. The PERC cards manage the data in cache independently of the operating system or any applications. You can use OpenManage or the BIOS configuration utility to view and manage virtual disk cache settings.
In Write-Through caching, the controller sends a data-transfer completion signal to the host system when the disk subsystem has received all the data in a transaction. In Write-Back caching, the controller sends a data transfer completion signal to the host when the controller cache has received all the data in a transaction. The controller then writes the cached data to the storage device in the background.
The risk of using Write-Back cache is that the cached data can be lost if there is a power failure before it is written to the storage device. This risk is mitigated by using a BBU on PERC H700 or H800 cards. Write-Back caching has a performance advantage over Write-Through caching. The default cache setting for virtual disks is Write-Back caching. Certain data patterns and configurations perform better with a Write-Through cache policy.
Write-Back caching is used under all conditions in which the battery is present and in good condition."

On 28 May 2015, at 01:20, Bruce Momjian <bruce@xxxxxxxxxx> wrote:

> On Thu, May 21, 2015 at 11:21:49AM +0000, Graeme B. Bell wrote:
>>> Not using your raid controllers write cache then?  Not sure just how
>>> important that is with SSDs these days, but if you've got a BBU set
>>> it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache
>>> if Bad BBU" if you do that.
>> 
>> I did quite a few tests with WB and WT last year.
>> 
>> - WT should be OK with e.g. Intel SSDs.  From memory I saw write
>> performance gains of about 20-30% with Crucial M500/M550 writes on a
>> Dell H710 RAID controller. BUT that controller didn't have WT fastpath
>> though which is absolutely essential to see substantial gains WT. I
>> expect with WT and a fastpath enabled RAID you'd see much higher
>> numbers, e.g. 100%+ higher IOPS.
>> 
>> (So, if you don't have fastpath on your controller, you might as
>> well plan to leave WB on and just buy cheaper SSD drives rather than
>> expensive ones - the raid controller will be your choke point for
>> performance on WT and it's a source of risk).
>> 
>> - WT with most SSDs will likely corrupt your postgres database the
>> first time you lose power. (on all the drives I've tested)
>> 
>> - WB is the only safe option unless you have done lots of plug pull
>> tests on a drive that is guaranteed to protect data "in flight" during
>> power loss (Intel disks + maybe the new samsung pcie).
> 
> I think you have WT (write-through) and WB (write-back) reversed above.
> 
> -- 
>  Bruce Momjian  <bruce@xxxxxxxxxx>        http://momjian.us
>  EnterpriseDB                             http://enterprisedb.com
> 
>  + Everyone has their own god. +

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin