Re: OT (slightly) testing for data loss on an SSD drive due to power failure

Greg Smith <greg@xxxxxxxxxxxxxxx> · Fri, 22 Apr 2011 22:48:04 -0400

On 04/22/2011 10:04 AM, John Rouillard wrote:
We have a couple of ssd's 2 x 160GB Intel X25-M MLC SATA
acting as the zil (write journal) and are trying to see if it is safe
to use for a power fail situation.

Well, the quick answer is "no".  I've lost several weekends of my life 
to recovering information from database stored on those drivers, after 
they were corrupted in a crash.

The testing method is to copy a bunch of files over NFS to the server
with the zil. When the copy is running along, pull the power to the
server. The NFS client will stop and if the client got a message that
block X was written safely to the zil, it will continue writing with
block x+1. After the server comes backup and and the copies
resume/finish the files are checksummed. If block X went missing, the
checksums will fail and we will have our proof.

Interestingly, you have reinvented parts of the standard script for 
testing for data loss, diskchecker.pl:  
http://brad.livejournal.com/2116715.html

You can get a few thousand commits per second using that program, which 
is enough to fill the drive buffer such that a power pull should 
sometimes lose something.  I don't think you can do a proper test here 
using NFS; you really need something that is executing fsync calls 
directly in the same pattern a database server will.

ZFS is more resilient than most filesystem as far as avoiding file 
corruption in this case.  But you should still be able to find some 
missing transactions that are sitting in the drive cache.

--
Greg Smith   2ndQuadrant US    greg@xxxxxxxxxxxxxxx   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance