On Mon, Dec 17, 2012 at 02:01:20PM +0200, Heikki Linnakangas wrote: > Hmm, is it possible that some WAL was generated in the old master, > and streamed to the standby, after the new master was already > promoted? It's important to kill the old master before promoting the > new master. Otherwise the timelines diverge, so that you have some > WAL on the old timeline that's not present in the new master, and > some WAL in the new master's timeline that's not present in the old > master. In that situation, if the standby has already replicated the > WAL from the old master, it can no longer start to follow the new > master. I think that would match the symptoms you're seeing. OK. Checked, and can't get it work. Situation: 3 hosts: ubuntu1 (172.28.173.140) ubuntu2 (172.28.173.141) ubuntu3 (172.28.173.142) on all I have 9.3head compiled. Also I have floating IP (172.28.173.253) which is aliased on ubuntu1. On ubuntu1 I make simple Pg initdb, start backup, copy data to ubuntu2 and ubuntu3, and stop backup. than, on both ubuntu2 and 3, I do: rm data/pg_xlog/0* data/pg_xlog/archive_status/* data/pg_log/* data/postmaster.pid ; cp ~/recovery.conf ~/data/ recovery.conf is: standby_mode = 'on' primary_conninfo = 'port=5920 user=replication host=172.28.173.253' trigger_file = '/tmp/finish.replication' recovery_target_timeline='latest' relevant settings in master conf: ----------------- =$ grep -i -E '(wal|checkpoint|archive)' data/postgresql.conf # RESOURCE USAGE (except WAL) wal_level = hot_standby #wal_sync_method = fsync # the default is the first option #wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers #wal_writer_delay = 200ms # 1-10000 milliseconds # - Checkpoints - #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each #checkpoint_timeout = 5min # range 30s-1h #checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0 #checkpoint_warning = 30s # 0 disables archive_mode = on archive_command = '/bin/true' # placeholders: %p = path of file to archive # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f' #archive_timeout = 0 # force a logfile segment switch after this max_wal_senders = 3 # max number of walsender processes wal_keep_segments = 20 # in logfile segments, 16MB each; 0 disables #wal_sender_timeout = 60s # in milliseconds; 0 disables #max_standby_archive_delay = 30s # max delay before canceling queries # when reading WAL from archive; # when reading streaming WAL; #wal_receiver_status_interval = 10s # send replies at least this often #wal_receiver_timeout = 60s # time that receiver waits for log_checkpoints = on ----------------- Afterwards I do: on ubuntu1: pg_ctl -m fast stop sudo ifdown eth0:0 eth0:0 is the interface with shared ip. all works fine. then on ubuntu2 I do: touch /tmp/finish.replication sudo ifup eth0:0 result: ubuntu2 is working OK, but ubuntu3 logs: 2012-12-18 12:33:34.037 CET @ 1513 LOG: database system was interrupted; last known up at 2012-12-18 12:32:47 CET 2012-12-18 12:33:34.037 CET @ 1513 LOG: entering standby mode 2012-12-18 12:33:49.075 CET @ 1514 LOG: started streaming WAL from primary at 0/14000000 on timeline 1 2012-12-18 12:33:49.279 CET @ 1513 LOG: redo starts at 0/14000028 2012-12-18 12:33:49.282 CET @ 1513 LOG: consistent recovery state reached at 0/140000E8 2012-12-18 12:33:49.282 CET @ 1508 LOG: database system is ready to accept read only connections 2012-12-18 12:34:14.007 CET @ 1514 LOG: replication terminated by primary server 2012-12-18 12:34:14.007 CET @ 1514 DETAIL: End of WAL reached on timeline 1 2012-12-18 12:34:14.014 CET @ 1514 FATAL: could not send data to WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2012-12-18 12:34:14.014 CET @ 1513 LOG: record with zero length at 0/15000088 2012-12-18 12:34:44.066 CET @ 1598 LOG: fetching timeline history file for timeline 2 from primary server 2012-12-18 12:34:44.071 CET @ 1598 LOG: started streaming WAL from primary at 0/15000000 on timeline 1 2012-12-18 12:34:44.071 CET @ 1598 LOG: replication terminated by primary server 2012-12-18 12:34:44.071 CET @ 1598 DETAIL: End of WAL reached on timeline 1 2012-12-18 12:34:44.071 CET @ 1598 FATAL: error reading result of streaming command: ERROR: requested WAL segment 000000010000000000000015 has already been removed 2012-12-18 12:34:44.072 CET @ 1513 LOG: new target timeline is 2 2012-12-18 12:34:44.078 CET @ 1612 LOG: started streaming WAL from primary at 0/15000000 on timeline 2 2012-12-18 12:34:44.078 CET @ 1612 LOG: replication terminated by primary server 2012-12-18 12:34:44.078 CET @ 1612 DETAIL: End of WAL reached on timeline 2 2012-12-18 12:34:44.078 CET @ 1612 FATAL: error reading result of streaming command: ERROR: requested WAL segment 000000020000000000000015 has already been removed ... and so on. Looks like it did switch timeline, but the new slave doesn't have some bits in WAL. Which is weird, because, on ubuntu2: $ select pg_current_xlog_location(); pg_current_xlog_location ────────────────────────── 0/150000E8 (1 row) On ubuntu2 in pg_xlog: 12:41:05 pgdba@ubuntu2 ~/data/pg_xlog =$ ls -la total 49168 drwx------ 3 pgdba pgdba 4096 Dec 18 12:34 ./ drwx------ 15 pgdba pgdba 4096 Dec 18 12:34 ../ -rw------- 1 pgdba pgdba 16777216 Dec 18 12:34 000000020000000000000016 -rw------- 1 pgdba pgdba 16777216 Dec 18 12:34 000000020000000000000017 -rw------- 1 pgdba pgdba 16777216 Dec 18 12:33 000000020000000000000018 -rw------- 1 pgdba pgdba 42 Dec 18 12:34 00000002.history drwx------ 2 pgdba pgdba 4096 Dec 18 12:34 archive_status/ 12:41:06 pgdba@ubuntu2 ~/data/pg_xlog =$ cat 00000002.history 1 0/15000088 no recovery target specified In pg_log on ubuntu2 I see: 2012-12-18 12:41:34.428 CET [unknown]@[unknown] 1685 LOG: connection received: host=172.28.173.142 port=45842 2012-12-18 12:41:34.430 CET replication@[unknown] 1685 172.28.173.142(45842) LOG: replication connection authorized: user=replication 2012-12-18 12:41:34.432 CET replication@[unknown] 1685 172.28.173.142(45842) ERROR: requested WAL segment 000000020000000000000015 has already been removed 2012-12-18 12:41:34.433 CET replication@[unknown] 1685 172.28.173.142(45842) LOG: disconnection: session time: 0:00:00.005 user=replication database= host=172.28.173.142 port=45842 Something looks weird. To put it lightly. Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/ -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general