Re: 8.3.5 broken after power fail SOLVED

Michael Monnerie <michael.monnerie@xxxxxxxxxxxxxxxxxxx> · Sat, 21 Feb 2009 13:58:59 +0100

On Samstag 21 Februar 2009 Scott Marlowe wrote:
> We preach this again and again.  PostgreSQL can only survive a power
> outage type failure ONLY if the hardware / OS / filesystem don't lie
> about fsync.  If they do, all bets are off, and this kind of failure
> means you should really failover to another machine or restore a
> backup.

The shit thing is, I just discussed with the XFS devs last week, whether 
it is save to have a virtualization like VMware or XEN, and the answer 
was "depends on the hypervisor". I had such an issue with VMware 2 years 
ago, and now with XEN, so I would say they are not save. But there must 
be something you can configure in order not to have such drastic errors 
on power fail. It's just nobody seems to know (or want to tell) how to 
do that. At least, not to me ;-)

> It's why you have to do possibly destructive tests to see if your
> server stands at least some chance of surviving this kind of failure,
> log shipping for recovery, and / or replication of another form
> (slony etc...) to have a reliable server.

As I need another Postgres setup with a server syncing dbmail to 
another, I guess I'll do that with WAL, so at least then I can recover 
to that latest entry.

> The recommendations for recovery of data are just that, recovery
> oriented.  They can't fix a broken database at that point.  You need
> to take it offline after this kind of failure if you can't trust your
> hardware.
>
> Usually when it finds something wrong it just won't start up.

The problem was I wasn't working this week, and did just a basic check 
if everything is up again. There were e-mails arriving, so I thought 
it's OK. I was very pissed when some days later I found strange things 
happening, and then to see that a table was broken and ate nearly all e-
mails. If at least Postgres would have whined and stopped working...

I know it's not Postgres' fault to have fsync messed up, but at least 
error recovery should have found the problem, latest at the moment the 
first transaction touched the problematic table. Instead of throwing the 
data effectively to /dev/null :-(

mfg zmi
-- 
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0660 / 415 65 31                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net                  Key-ID: 1C1209B4

Attachment:
signature.asc

Description: This is a digitally signed message part.