On 10/24/12 4:04 PM, Chris Angelico wrote:
Is this a useful and plausible testing methodology? It's definitely showed up some failures. On a hard-disk, all is well as long as the write-back cache is disabled; on the SSDs, I can't make them reliable.
On Linux systems, you can tell when Postgres is busy writing data out during a checkpoint because the "Dirty:" amount will be dropping rapidly. At most other times, that number goes up. You can try to increase the odds of finding database level corruption during a pull the plug test by trying to yank during that most sensitive moment. Combine a reasonable write-heavy test like you've devised with that "optimization", and systems that don't write reliably will usually corrupt within a few tries.
In general, through, diskchecker.pl is the more sensitive test. If it fails, storage is unreliable for PostgreSQL, period. It's good that you've followed up by confirming the real database corruption implied by that is also visible. In general, though, that's not needed. Diskchecker says the drive is bad, you're done--don't put a database on it. Doing the database level tests is more for finding false positives: where diskchecker says the drive is OK, but perhaps there is a filesystem problem that makes it unreliable, one that it doesn't test for.
What SSD are you using? The Intel 320 and 710 series models are the only SATA-connected drives still on the market I know of that pass a serious test. The other good models are direct PCI-E storage units, like the FusionIO drives.
-- Greg Smith 2ndQuadrant US greg@xxxxxxxxxxxxxxx Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general