Problem with 9.1 streaming replication

Georges Racinet <gracinet@xxxxxxxxx> · Mon, 23 Jul 2012 14:09:32 +0200

Hi all.

While testing a replication setup with PostgreSQL 9.1.4, I'm having an 
error after promoting the slave to master : some file under the 'base' 
subdirectory could not be read, that only 0 bytes could be fetched (see 
the log extract at the end) Indeed the actual file size is 0.
I believe that whatever configuration mistake I may have made, such a 
corruption should never happen, isn't it ?

That error is persistent accross the cluster restarts. Basically, the DB 
is corrupted, almost nothing works. The only option is to reconstruct it 
from a dump.

The replication itself works, I'm able to start it with pg_basebackup in 
both ways.

I thought for a while that the error happended because I hade made the 
mistake not to configure wal_keep_segments (didn't realize the default 
value was not small but actually zero). Is that realistic

I set it since the first attempts to a value that I believe to be 
generous (1024, that should mean 16 GB of WAL). After that, I had a 
succesful failover simulation.

But the error got back with the same fatal corruption symptoms 
yesterday. It seems to be correlated to the size of data being 
replicated. This time, that was right after a pg_restore. (dumps in 
custom format are around 50 MB).

The bandwith between the servers is quite sufficient : I witnessed up to 
70 MB/s with rsync.

Promotion is done with Debian's pg_ctlcluster promote, which I believe 
to be like other Debian tools a wrapper to select the right cluster.
Application software starts after the promotion.

Any hint appreciated, thanks !

Precise version:  9.1.4-2~bpo60+1 from Debian squeeze-backports

Log extract (french locale, here):
2012-07-22 21:27:59 UTC LOG:  restauration termin?e de l'archive
2012-07-22 21:27:59 UTC LOG:  le syst?me de bases de donn?es est pr?t 
pour accepter les connexions
2012-07-22 21:27:59 UTC LOG:  lancement du processus autovacuum
2012-07-22 21:30:19 UTC ERREUR:  n'a pas pu lire le bloc 0 du fichier « 
base/142824/151268 » : a lu seulement 0 octets
        sur 8192

--
Georges Racinet
Anybox SAS, http://anybox.fr
Bureau: 09 53 53 72 97 Portable: 06 51 32 07 27
GPG: 0x33AB0A35, sur serveurs publics

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general