Re: Plug-pull testing worked, diskchecker.pl failed

Greg Smith <greg@xxxxxxxxxxxxxxx> · Sat, 27 Oct 2012 07:26:33 +0200

On 10/24/12 4:04 PM, Chris Angelico wrote:

Is this a useful and plausible testing methodology? It's definitely
showed up some failures. On a hard-disk, all is well as long as the
write-back cache is disabled; on the SSDs, I can't make them reliable.

On Linux systems, you can tell when Postgres is busy writing data out 
during a checkpoint because the "Dirty:" amount will be dropping 
rapidly.  At most other times, that number goes up.  You can try to 
increase the odds of finding database level corruption during a pull the 
plug test by trying to yank during that most sensitive moment.  Combine 
a reasonable write-heavy test like you've devised with that 
"optimization", and systems that don't write reliably will usually 
corrupt within a few tries.

In general, through, diskchecker.pl is the more sensitive test.  If it 
fails, storage is unreliable for PostgreSQL, period.   It's good that 
you've followed up by confirming the real database corruption implied by 
that is also visible.  In general, though, that's not needed. 
Diskchecker says the drive is bad, you're done--don't put a database on 
it.  Doing the database level tests is more for finding false positives: 
 where diskchecker says the drive is OK, but perhaps there is a 
filesystem problem that makes it unreliable, one that it doesn't test for.

What SSD are you using?  The Intel 320 and 710 series models are the 
only SATA-connected drives still on the market I know of that pass a 
serious test.  The other good models are direct PCI-E storage units, 
like the FusionIO drives.

--
Greg Smith   2ndQuadrant US    greg@xxxxxxxxxxxxxxx   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general