Re: Possible causes for database corruption and solutions

Michael Clark <codingninja@xxxxxxxxx> · Wed, 16 Dec 2009 08:19:37 -0500

Hello Craig - thanks for the reply.  I will reply below.

On Tue, Dec 15, 2009 at 9:34 PM, Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> wrote:

On 16/12/2009 6:39 AM, Michael Clark wrote:

Hello all,

Over the past 6 months or so I have posted to the list a couple times

looking for information regarding recovering databases from corruption.

  At the time the incidents of corruption among our users was very low,

but the frequency is starting to increase, most likely due to the

increased user base that has upgraded to the version of our software

that uses Postgres.

I have a couple questions I am hoping to get some feedback on.

Secondly, I ask about an alternative solution to the corruption problem

because with preliminary testing we have seen a significant degradation

in performance.

>From changing to fsync_writethrough ? That's a good thing - it suggests that maybe now the data is actually hitting disk when Pg asks it to.

You can have fast(er), or safe, but not both. Now that your database is actually doing what it should be and truthfully promising that data has hit disk when you commit, you may have to adopt some strategies to reduce the number of very short repetitive transactions you perform.

That is true, this does bode well for us finding our cause.  And that old axiom does often ring true.

( Should Pg perhaps detect OS X and switch the default to fsync_writethrough ? Or are the "test to see if fsync() works on startup" plans going anywhere? )

Not sure if you were directing that to me specifically, I feel probably to the general PG public.
But, this is probably a lot less of a problem in a proper server environment, which most of our users are not in. (I touch more on that to in my reply to you and Scott).
For example, we run our software on a Mac Pro, with raid 1 and the machine is protected by UPS.  We have never had an issue ourselves.
I guess it depends on the majority use case for PG on OS X whether this should be defaulted to on.
(Although siding with caution usually never hurts when data is concerned.)

I then restore the dump into a newly created DB (createdb -E UTF8 Test1)

with the command:

pg_restore -F c -d Test1 pgbackup

Make sure to do the restore in a single transaction. It will be *enormously*, *massively* faster. See the "--single-transaction" option to pg_restore.

Thanks for that tip, cut the restore in half in this particular test.

If you can't do that, then set a commit_delay so that PostgreSQL can batch the fsync()s for commmits together. See postgresql.conf and the documentation for commit_delay and commit_siblings.

I will look those up as well, thanks.

I am hoping there may be an alternative to fsync_writethrough and/or we

are barking up the wrong tree w.r.t. the cause of the corruptions.

If fsync_writethrough slowed things down that much then you've almost certainly nailed the cause of the corruptions. Now you just need to tune your DB, and adapt how you use the DB, so that you're less affected by the necessary performance hit imposed by safe and reliable use of disk storage.

It definitely feels better knowing that we likely found our issue, and you are right, time to tune as best we can.
Luckily these sort of large operations are generally the smaller percentage of use cases.

Thanks again for the reply,
Michael

--

Craig Ringer