Hot standby with streaming replication under PgSQL 9.1.x : failover when master crashes?

"David Guyot (English)" <david.guyot@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Tue, 02 Apr 2013 16:29:34 +0200

Hello, world!

I'm trying to figure out how to use fail-over with 2 PgSQL 9.1 servers;
I already succeeded in exchanging roles (ie. making standby master and
vice versa), but only when both master and slave are up and running. I
tried to simulate a master crash by killing its virtual machine, but,
when I try to promote the standby server, even if
/usr/lib/postgresql/9.1/bin/pg_ctl promote -D
/var/lib/postgresql/9.1/main/ answers that promote is in progress, it
looks like the standby tries to connect to the no-longer-running master
to get some WAL files (despite the streaming replication; I noticed it
when I saw PgSQL launching SCP attempts on the crashed master), and, as
the master no longer runs, fails and stop. Here is the corresponding
(CSV) log output :

2013-04-02 15:13:56.983
CEST,,,3523,"[local]",515ad994.dc3,1,"",2013-04-02 15:13:56
CEST,,0,LOG,08P01,"incomplete startup
packet",,,,,,,,"ProcessStartupPacket, postmaster.c:1604",""
2013-04-02 15:13:56.985 CEST,,,3522,,515ad994.dc2,1,,2013-04-02 15:13:56
CEST,,0,LOG,00000,"database system was interrupted; last known up at
2013-04-02 14:46:17 CEST",,,,,,,,"StartupXLOG, xlog.c:6098",""
2013-04-02 15:13:56.985 CEST,,,3522,,515ad994.dc2,2,,2013-04-02 15:13:56
CEST,,0,LOG,00000,"creating missing WAL directory
""pg_xlog/archive_status""",,,,,,,,"ValidateXLOGDirectoryStructure,
xlog.c:3514",""
2013-04-02 15:13:57.589
CEST,"postgres","postgres",3529,"[local]",515ad995.dc9,1,"",2013-04-02
15:13:57 CEST,,0,FATAL,57P03,"the database system is starting
up",,,,,,,,"ProcessStartupPacket, postmaster.c:1858",""
2013-04-02 15:13:57.842 CEST,,,3522,,515ad994.dc2,3,,2013-04-02 15:13:56
CEST,,0,LOG,00000,"entering standby mode",,,,,,,,"StartupXLOG,
xlog.c:6164",""
2013-04-02 15:13:58.106
CEST,"postgres","postgres",3536,"[local]",515ad996.dd0,1,"",2013-04-02
15:13:58 CEST,,0,FATAL,57P03,"the database system is starting
up",,,,,,,,"ProcessStartupPacket, postmaster.c:1858",""
2013-04-02 15:13:58.620
CEST,"postgres","postgres",3539,"[local]",515ad996.dd3,1,"",2013-04-02
15:13:58 CEST,,0,FATAL,57P03,"the database system is starting
up",,,,,,,,"ProcessStartupPacket, postmaster.c:1858",""
2013-04-02 15:13:58.663 CEST,,,3522,,515ad994.dc2,4,,2013-04-02 15:13:56
CEST,,0,LOG,00000,"restored log file ""0000000200000001000000D5"" from
archive",,,,,,,,"RestoreArchivedFile, xlog.c:3082",""
2013-04-02 15:13:59.005 CEST,,,3522,,515ad994.dc2,5,,2013-04-02 15:13:56
CEST,1/1,0,LOG,00000,"redo starts at 1/D5000020",,,,,,,,"StartupXLOG,
xlog.c:6563",""
2013-04-02 15:13:59.006 CEST,,,3522,,515ad994.dc2,6,,2013-04-02 15:13:56
CEST,1/1,0,LOG,00000,"consistent recovery state reached at
1/D6000000",,,,,,,,"CheckRecoveryConsistency, xlog.c:7023",""
2013-04-02 15:13:59.006 CEST,,,3520,,515ad992.dc0,1,,2013-04-02 15:13:54
CEST,,0,LOG,00000,"database system is ready to accept read only
connections",,,,,,,,"sigusr1_handler, postmaster.c:4268",""
2013-04-02 15:13:59.493 CEST,,,3548,,515ad997.ddc,1,,2013-04-02 15:13:59
CEST,,0,LOG,00000,"streaming replication successfully connected to
primary",,,,,,,,"libpqrcv_connect, libpqwalreceiver.c:171",""
2013-04-02 15:15:46.940 CEST,,,3522,,515ad994.dc2,7,,2013-04-02 15:13:56
CEST,1/1,0,LOG,00000,"received promote
request",,,,,,,,"CheckForStandbyTrigger, xlog.c:10578",""
2013-04-02 15:15:46.940 CEST,,,3548,,515ad997.ddc,2,,2013-04-02 15:13:59
CEST,,0,FATAL,57P01,"terminating walreceiver process due to
administrator command",,,,,,,,"ProcessWalRcvInterrupts,
walreceiver.c:150",""
2013-04-02 15:16:08.044 CEST,,,3522,,515ad994.dc2,8,,2013-04-02 15:13:56
CEST,1/1,0,FATAL,XX000,"could not restore file
""0000000200000001000000D6"" from archive: return code
65280",,,,,,,,"RestoreArchivedFile, xlog.c:3132",""
2013-04-02 15:16:08.044 CEST,,,3520,,515ad992.dc0,2,,2013-04-02 15:13:54
CEST,,0,LOG,00000,"startup process (PID 3522) exited with exit code
1",,,,,,,,"LogChildExit, postmaster.c:2878",""
2013-04-02 15:16:08.044 CEST,,,3520,,515ad992.dc0,3,,2013-04-02 15:13:54
CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,"HandleChildCrash, postmaster.c:2712",""

As you can see, it tries to download an archive file from the old
master, but, obviously, it can't as the master has stopped.

Here is my recovery.conf :

standby_mode = 'on'
primary_conninfo = 'host=192.168.0.80 port=5433 user=replika
password=replika'
restore_command = 'rsync -av 192.168.0.80:/var/lib/postgresql/WAL/%f %p'

And my postgresql.conf (only the non-default values, same config on both
servers; note that all locales but lc_messages are in French because I'm
French and so are the servers [no froggies jokes, please ;)], but the
problem remains even if locales are all set to fr_FR.UTF-8; lc_messages
is set to en_GB.UTF-8 to allow me to send this mesage with English logs) :

data_directory = '/var/lib/postgresql/9.1/main'
hba_file = '/etc/postgresql/9.1/main/pg_hba.conf'
ident_file = '/etc/postgresql/9.1/main/pg_ident.conf'
external_pid_file = '/var/run/postgresql/9.1-main.pid'
listen_addresses = '*'
port = 5433
max_connections = 10
unix_socket_directory = '/var/run/postgresql'
ssl = true
shared_buffers = 24MB
wal_level = hot_standby
wal_buffers = 8MB
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/WAL/%f && cp %p
/var/lib/postgresql/WAL/%f'
max_wal_senders = 5
hot_standby = on
log_destination = 'csvlog'
logging_collector = on
log_directory = '/var/log/postgresql'
log_filename = 'postgresql-requests.log'
log_min_messages = info
log_error_verbosity = verbose
log_line_prefix = '%t '
lc_messages = 'en_GB.UTF-8'
lc_monetary = 'fr_FR.UTF-8'
lc_numeric = 'fr_FR.UTF-8'
lc_time = 'fr_FR.UTF-8'
default_text_search_config = 'pg_catalog.french'

I assume I missed something there because, as this is a replication
mechanism, in my opinion, the standby should simply be promoted after
master failure, or at least I should have the possibility to ask this to
the standby while asking not to connect to the old master and make its
own way with the local data and whose the replication stream already
sent. Or is it the standby promotion procedure in case of master failure
which misses something? I can't tell because the documentation isn't
verbose about the standby promotion after a master failure (or I wasn't
able to find it by myself). If there is a detailed procedure available
for the event of a master crash, where is it? Else, where is my error?

Thank you in advance.

Regards.

-- 
David Guyot
Sysadmin
Europe Camions Interactive
Moulin Collot
F-88500 Ambacourt

Tel: +33 (0)3 29 30 47 85
Fax : +33 (0)3 29 31 31 31

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin