Re: Plug-pull testing worked, diskchecker.pl failed

Scott Marlowe <scott.marlowe@xxxxxxxxx> · Wed, 24 Oct 2012 10:18:53 -0600

On Wed, Oct 24, 2012 at 8:04 AM, Chris Angelico <rosuav@xxxxxxxxx> wrote:
> On Tue, Oct 23, 2012 at 9:51 AM, Scott Marlowe <scott.marlowe@xxxxxxxxx> wrote:
>> On Mon, Oct 22, 2012 at 7:17 AM, Chris Angelico <rosuav@xxxxxxxxx> wrote:
>>> After reading the comments last week about SSDs, I did some testing of
>>> the ones we have at work - each of my test-boxes (three with SSDs, one
>>> with HDD) subjected to multiple stand-alone plug-pull tests, using
>>> pgbench to provide load. So far, there've been no instances of
>>> PostgreSQL data corruption, but diskchecker.pl reported huge numbers
>>> of errors.
>>
>> Try starting pgbench, and then halfway through the timeout for a
>> checkpoint timeout issue a checkpoint and WHILE the checkpoint is
>> still running THEN pull the plug.
>>
>> Then after bringing the server up (assuming pg starts up) see if
>> pg_dump generates any errors.
>
> Thanks for the tip. I've been flat-out at work these past few days and
> haven't gotten around to testing in the middle of a checkpoint, but I
> have done something that might also be of interest. It's inspired by a
> combination of diskchecker and pgbench; a harness that puts the
> database under load and retains a record of what's been done.
>
> In brief: Create a table with N (eg 100) rows, then spin as fast as
> possible, incrementing a counter against one random row and also
> incrementing the "Total" counter. When the database goes down, wait
> for it to come up again; when it does, check against the local copy of
> the counters and report any discrepancies.
>
> The code's written in Pike, using the same database connection logic
> that we use in our actual application (well, some of our code is C++
> and some is PHP, so this corresponds to one part of our app), so this
> is roughly representative of real usage.
>
> It's about a page or two of code: http://pastebin.com/UNTj642Y

Very cool.  Nice little project.

> Currently, all the key parameters (database connection info (which has
> been censored for the pastebin version), pool size, thread count, etc)
> are just variables visible in the script, simpler than parsing
> command-line arguments.
>
> Is this a useful and plausible testing methodology? It's definitely
> showed up some failures. On a hard-disk, all is well as long as the
> write-back cache is disabled; on the SSDs, I can't make them reliable.

Yes it seems to be quite a good idea actually.

> Is a single table enough to test for corruption with?

If it fails, definitely, if it passes maybe.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general