Hello, world! I'm trying to figure out how to use fail-over with 2 PgSQL 9.1 servers; I already succeeded in exchanging roles (ie. making standby master and vice versa), but only when both master and slave are up and running. I tried to simulate a master crash by killing its virtual machine, but, when I try to promote the standby server, even if /usr/lib/postgresql/9.1/bin/pg_ctl promote -D /var/lib/postgresql/9.1/main/ answers that promote is in progress, it looks like the standby tries to connect to the no-longer-running master to get some WAL files (despite the streaming replication; I noticed it when I saw PgSQL launching SCP attempts on the crashed master), and, as the master no longer runs, fails and stop. Here is the corresponding (CSV) log output : 2013-04-02 15:13:56.983 CEST,,,3523,"[local]",515ad994.dc3,1,"",2013-04-02 15:13:56 CEST,,0,LOG,08P01,"incomplete startup packet",,,,,,,,"ProcessStartupPacket, postmaster.c:1604","" 2013-04-02 15:13:56.985 CEST,,,3522,,515ad994.dc2,1,,2013-04-02 15:13:56 CEST,,0,LOG,00000,"database system was interrupted; last known up at 2013-04-02 14:46:17 CEST",,,,,,,,"StartupXLOG, xlog.c:6098","" 2013-04-02 15:13:56.985 CEST,,,3522,,515ad994.dc2,2,,2013-04-02 15:13:56 CEST,,0,LOG,00000,"creating missing WAL directory ""pg_xlog/archive_status""",,,,,,,,"ValidateXLOGDirectoryStructure, xlog.c:3514","" 2013-04-02 15:13:57.589 CEST,"postgres","postgres",3529,"[local]",515ad995.dc9,1,"",2013-04-02 15:13:57 CEST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,"ProcessStartupPacket, postmaster.c:1858","" 2013-04-02 15:13:57.842 CEST,,,3522,,515ad994.dc2,3,,2013-04-02 15:13:56 CEST,,0,LOG,00000,"entering standby mode",,,,,,,,"StartupXLOG, xlog.c:6164","" 2013-04-02 15:13:58.106 CEST,"postgres","postgres",3536,"[local]",515ad996.dd0,1,"",2013-04-02 15:13:58 CEST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,"ProcessStartupPacket, postmaster.c:1858","" 2013-04-02 15:13:58.620 CEST,"postgres","postgres",3539,"[local]",515ad996.dd3,1,"",2013-04-02 15:13:58 CEST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,"ProcessStartupPacket, postmaster.c:1858","" 2013-04-02 15:13:58.663 CEST,,,3522,,515ad994.dc2,4,,2013-04-02 15:13:56 CEST,,0,LOG,00000,"restored log file ""0000000200000001000000D5"" from archive",,,,,,,,"RestoreArchivedFile, xlog.c:3082","" 2013-04-02 15:13:59.005 CEST,,,3522,,515ad994.dc2,5,,2013-04-02 15:13:56 CEST,1/1,0,LOG,00000,"redo starts at 1/D5000020",,,,,,,,"StartupXLOG, xlog.c:6563","" 2013-04-02 15:13:59.006 CEST,,,3522,,515ad994.dc2,6,,2013-04-02 15:13:56 CEST,1/1,0,LOG,00000,"consistent recovery state reached at 1/D6000000",,,,,,,,"CheckRecoveryConsistency, xlog.c:7023","" 2013-04-02 15:13:59.006 CEST,,,3520,,515ad992.dc0,1,,2013-04-02 15:13:54 CEST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,"sigusr1_handler, postmaster.c:4268","" 2013-04-02 15:13:59.493 CEST,,,3548,,515ad997.ddc,1,,2013-04-02 15:13:59 CEST,,0,LOG,00000,"streaming replication successfully connected to primary",,,,,,,,"libpqrcv_connect, libpqwalreceiver.c:171","" 2013-04-02 15:15:46.940 CEST,,,3522,,515ad994.dc2,7,,2013-04-02 15:13:56 CEST,1/1,0,LOG,00000,"received promote request",,,,,,,,"CheckForStandbyTrigger, xlog.c:10578","" 2013-04-02 15:15:46.940 CEST,,,3548,,515ad997.ddc,2,,2013-04-02 15:13:59 CEST,,0,FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,"ProcessWalRcvInterrupts, walreceiver.c:150","" 2013-04-02 15:16:08.044 CEST,,,3522,,515ad994.dc2,8,,2013-04-02 15:13:56 CEST,1/1,0,FATAL,XX000,"could not restore file ""0000000200000001000000D6"" from archive: return code 65280",,,,,,,,"RestoreArchivedFile, xlog.c:3132","" 2013-04-02 15:16:08.044 CEST,,,3520,,515ad992.dc0,2,,2013-04-02 15:13:54 CEST,,0,LOG,00000,"startup process (PID 3522) exited with exit code 1",,,,,,,,"LogChildExit, postmaster.c:2878","" 2013-04-02 15:16:08.044 CEST,,,3520,,515ad992.dc0,3,,2013-04-02 15:13:54 CEST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,"HandleChildCrash, postmaster.c:2712","" As you can see, it tries to download an archive file from the old master, but, obviously, it can't as the master has stopped. Here is my recovery.conf : standby_mode = 'on' primary_conninfo = 'host=192.168.0.80 port=5433 user=replika password=replika' restore_command = 'rsync -av 192.168.0.80:/var/lib/postgresql/WAL/%f %p' And my postgresql.conf (only the non-default values, same config on both servers; note that all locales but lc_messages are in French because I'm French and so are the servers [no froggies jokes, please ;)], but the problem remains even if locales are all set to fr_FR.UTF-8; lc_messages is set to en_GB.UTF-8 to allow me to send this mesage with English logs) : data_directory = '/var/lib/postgresql/9.1/main' hba_file = '/etc/postgresql/9.1/main/pg_hba.conf' ident_file = '/etc/postgresql/9.1/main/pg_ident.conf' external_pid_file = '/var/run/postgresql/9.1-main.pid' listen_addresses = '*' port = 5433 max_connections = 10 unix_socket_directory = '/var/run/postgresql' ssl = true shared_buffers = 24MB wal_level = hot_standby wal_buffers = 8MB checkpoint_segments = 32 checkpoint_completion_target = 0.9 archive_mode = on archive_command = 'test ! -f /var/lib/postgresql/WAL/%f && cp %p /var/lib/postgresql/WAL/%f' max_wal_senders = 5 hot_standby = on log_destination = 'csvlog' logging_collector = on log_directory = '/var/log/postgresql' log_filename = 'postgresql-requests.log' log_min_messages = info log_error_verbosity = verbose log_line_prefix = '%t ' lc_messages = 'en_GB.UTF-8' lc_monetary = 'fr_FR.UTF-8' lc_numeric = 'fr_FR.UTF-8' lc_time = 'fr_FR.UTF-8' default_text_search_config = 'pg_catalog.french' I assume I missed something there because, as this is a replication mechanism, in my opinion, the standby should simply be promoted after master failure, or at least I should have the possibility to ask this to the standby while asking not to connect to the old master and make its own way with the local data and whose the replication stream already sent. Or is it the standby promotion procedure in case of master failure which misses something? I can't tell because the documentation isn't verbose about the standby promotion after a master failure (or I wasn't able to find it by myself). If there is a detailed procedure available for the event of a master crash, where is it? Else, where is my error? Thank you in advance. Regards. -- David Guyot Sysadmin Europe Camions Interactive Moulin Collot F-88500 Ambacourt Tel: +33 (0)3 29 30 47 85 Fax : +33 (0)3 29 31 31 31 -- Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin