I'm attempting to make use of the ability defined by the statement at the end of the first paragraph in this section of the documentation (http://www.postgresql.org/docs/9.1/static/warm-standby.html#STANDBY-SERVER-SETUP). Specifically, "If you plan to have multiple standby servers for high availability purposes, set recovery_target_timeline to latest, to make the standby server follow the timeline change that occurs at failover to another standby."
To attempt this, I've setup 3 pg machines, a master (192.168.56.5) and 2 pitr slaves (192.168.56.6, 192.168.56.7), both connected to the master. Below, the hostnames will remain the same, even if a slave is promoted to master, so slave 1 will always refer to 192.168.56.6. I've compiled/installed pg 9.1.2 and pgbench for testing purposes. I've made changes to the postgresql.conf file as follows (for ease of testing, using same postgresql.conf on all servers):
wal_level = hot_standby
archive_mode = off
max_wal_senders = 3
wal_keep_segments = 100
hot_standby = on
hot_standby_feedback = on
And on the slaves, I've got this for recovery.conf:
standby_mode = 'on'
primary_conninfo = 'host=192.168.56.5 port=5432 user=postgres'
trigger_file = '/pgsql/omnipitr/finish.recovery'
recovery_target_timeline = latest
I've initialized a fresh DB on master, then executed the following to create fresh slave copies (with .7 instead of .6 to create slave 2):
pg_start_backup('start');
rsync -avh --exclude=pg_log --exclude=postgresql.conf --exclude=postmaster.pid . 192.168.56.6:/pgsql/91/data
pg_stop_backup();
rsync -avh --exclude=pg_log --exclude=postgresql.conf --exclude=postmaster.pid . 192.168.56.6:/pgsql/91/data
Both slaves were brought online and to prove that streaming replication was working, I created a table and threw some simple data into it. These showed up on the slaves, so setup appears good so far. For further data on the master, I used pgbench (time /opt/pgsql/bin/pgbench -i -s 35 test). In the middle of the pgbench data load (I let it load more than 1 million tuples first), I killed the postgresql/pgbench processes on the master.
root@pg912-master:~# ps -ef |grep post
postgres 1955 1 0 11:39 pts/0 00:00:00 /opt/pgsql/bin/postmaster -D /pgsql/91/data
postgres 1956 1955 0 11:39 ? 00:00:00 postgres: logger process
postgres 1958 1955 0 11:39 ? 00:00:00 postgres: writer process
postgres 1959 1955 0 11:39 ? 00:00:00 postgres: wal writer process
postgres 1960 1955 0 11:39 ? 00:00:00 postgres: autovacuum launcher process
postgres 1961 1955 0 11:39 ? 00:00:00 postgres: stats collector process
postgres 2013 1955 1 11:50 ? 00:00:11 postgres: wal sender process postgres 192.168.56.6(35184) streaming 0/2B000000
postgres 2015 1955 0 11:50 ? 00:00:02 postgres: wal sender process postgres 192.168.56.7(44052) streaming 0/2B000000
postgres 2049 1709 1 12:05 pts/1 00:00:00 /opt/pgsql/bin/pgbench -i -s 35 test
postgres 2050 1955 12 12:05 ? 00:00:00 postgres: postgres test [local] COPY
root 2052 1602 0 12:05 pts/0 00:00:00 grep post
root@pg912-master:~# kill -9 1955 1956 1958 1959 1960 1961 2013 2015 2049 2050
After the master terminated, I promoted slave 1 to become the new master by touching the trigger file (touch /pgsql/omnipitr/finish.recovery). Some relevant snippets from slave 1's log file:
2011-12-15 12:05:56.191 EST 2580:LOG: recovery restart point at 0/2D0BE0B0
2011-12-15 12:05:56.191 EST 2580:DETAIL: last completed transaction was at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:05:57.031 EST 2580:LOG: restartpoint starting: xlog
2011-12-15 12:06:03.067 EST 2580:LOG: restartpoint complete: wrote 98 buffers (3.2%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=4.907 s, sync=1.101 s, total=6.035 s; sync files=2, longest=1.070 s, average=0.550 s
2011-12-15 12:06:03.067 EST 2580:LOG: recovery restart point at 0/300BE0B0
2011-12-15 12:06:03.067 EST 2580:DETAIL: last completed transaction was at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:06:05.148 EST 2581:FATAL: could not receive data from WAL stream:
2011-12-15 12:06:09.769 EST 2579:LOG: invalid record length at 0/33C92290
2011-12-15 12:06:09.938 EST 2608:FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "192.168.56.5" and accepting
TCP/IP connections on port 5432?
... more of the same primary server connection refused error ...
2011-12-15 12:08:09.892 EST 2579:LOG: trigger file found: /pgsql/omnipitr/finish.recovery
2011-12-15 12:08:09.892 EST 2579:LOG: redo done at 0/33C921F0
2011-12-15 12:08:09.892 EST 2579:LOG: last completed transaction was at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:08:09.892 EST 2579:LOG: selected new timeline ID: 2
2011-12-15 12:08:10.078 EST 2579:LOG: archive recovery complete
2011-12-15 12:08:10.491 EST 2580:LOG: restartpoint complete: wrote 2692 buffers (87.6%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=112.051 s, sync=0.274 s, total=112.361 s; sync files=2, longest=0.273 s, average=0.137 s
2011-12-15 12:08:10.492 EST 2580:LOG: recovery restart point at 0/330BE0B0
2011-12-15 12:08:10.492 EST 2580:DETAIL: last completed transaction was at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:08:10.493 EST 2580:LOG: checkpoint starting: end-of-recovery immediate wait
2011-12-15 12:08:10.497 EST 2580:LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.005 s; sync files=0, longest=0.000 s, average=0.000 s
2011-12-15 12:08:10.951 EST 2651:LOG: autovacuum launcher started
2011-12-15 12:08:10.960 EST 2577:LOG: database system is ready to accept connections
Slave 2 was still attempting to connect to the master, so I modified its recovery.conf file to look like this (connect to slave 1 instead of master) and restarted pg on slave 2:
standby_mode = 'on'
primary_conninfo = 'host=192.168.56.6 port=5432 user=postgres'
trigger_file = '/pgsql/omnipitr/finish.recovery'
recovery_target_timeline = latest
Relevant log snippets from slave 2:
2011-12-15 12:05:54.992 EST 2626:LOG: recovery restart point at 0/2D0BE0B0
2011-12-15 12:05:54.992 EST 2626:DETAIL: last completed transaction was at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:05:56.295 EST 2626:LOG: restartpoint starting: xlog
2011-12-15 12:06:05.979 EST 2627:FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2011-12-15 12:06:06.191 EST 2625:LOG: unexpected pageaddr 0/28E40000 in log file 0, segment 50, offset 14942208
2011-12-15 12:06:06.218 EST 2626:LOG: restartpoint complete: wrote 185 buffers (6.0%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=9.692 s, sync=0.222 s, total=9.923 s; sync files=2, longest=0.204 s, average=0.111 s
2011-12-15 12:06:06.218 EST 2626:LOG: recovery restart point at 0/300BE0B0
2011-12-15 12:06:06.218 EST 2626:DETAIL: last completed transaction was at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:06:21.419 EST 2652:FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "192.168.56.5" and accepting
TCP/IP connections on port 5432?
... more of the same primary server connection refused error ...
2011-12-15 12:08:44.924 EST 2623:LOG: received fast shutdown request
2011-12-15 12:08:44.935 EST 2623:LOG: aborting any active transactions
2011-12-15 12:08:44.935 EST 2626:LOG: shutting down
2011-12-15 12:08:44.945 EST 2626:LOG: database system is shut down
2011-12-15 12:08:46.802 EST 2737:LOG: database system was shut down in recovery at 2011-12-15 12:08:44 EST
2011-12-15 12:08:46.804 EST 2737:LOG: entering standby mode
2011-12-15 12:08:46.818 EST 2737:LOG: redo starts at 0/300BE0B0
2011-12-15 12:08:47.313 EST 2737:LOG: consistent recovery state reached at 0/32E3FFF0
2011-12-15 12:08:47.313 EST 2735:LOG: database system is ready to accept read only connections
2011-12-15 12:08:47.313 EST 2737:LOG: unexpected pageaddr 0/28E40000 in log file 0, segment 50, offset 14942208
2011-12-15 12:08:47.437 EST 2740:FATAL: timeline 2 of the primary does not match recovery target timeline 1
2011-12-15 12:08:52.323 EST 2742:FATAL: timeline 2 of the primary does not match recovery target timeline 1
... repeated continuously ...
Any thoughts on the above snippets? Am I interpreting the documentation correctly? Is there any further information needed to debug this?
Thanks,
Rick