Dear Experts, Here is my latest issue with replication: I have 3 systems, X, Y and Z. Initially X is replicated to Y and Y is replicated to Z; in each case the replication involves log-shipping using archive_command and restore_command (via a 4th system called "backup") and then streaming. The plan is to shut down X and to promote Y to be the new master. I shut down X, and as expected Y complains that it con no longer connect to it: 2018-08-15 15:10:58.785 UTC [617] LOG: replication terminated by primary server 2018-08-15 15:10:58.785 UTC [617] DETAIL: End of WAL reached on timeline 1 at 7/E9000098. 2018-08-15 15:10:58.785 UTC [617] FATAL: could not send end-of-streaming message to primary: no COPY in progress scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory 2018-08-15 15:10:59.928 UTC [354] LOG: invalid resource manager ID 22 at 7/E9000098 2018-08-15 15:11:00.092 UTC [5856] FATAL: could not connect to the primary server: FATAL: the database system is shutting down Note that in the middle of that, after streaming has failed it tries to fetch the next WAL segment, E9, from the backup; this fails because this segment doesn't exist. Then I pg_ctl promote on Y: 2018-08-15 15:12:11.888 UTC [354] LOG: received promote request 2018-08-15 15:12:11.888 UTC [354] LOG: redo done at 7/E9000028 2018-08-15 15:12:11.888 UTC [354] LOG: last completed transaction was at log time 2018-08-15 14:45:05.961153+00 scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory scp: backup/postgresql/archivedir/00000002.history: No such file or directory 2018-08-15 15:12:13.316 UTC [354] LOG: selected new timeline ID: 2 2018-08-15 15:12:13.368 UTC [354] FATAL: could not open file "recovery.conf": Permission denied So in the process of exiting recovery mode it tries to rename recovery.conf to recovery.done, but it fails to do so because my recovery.conf was owned by root, not user postgres. Oooops. Perhaps it would be appropriate to check for that when it initially reads recovery.conf? Anyway, I fix the permissions and restart it. It starts in recovery mode and cannot connect to the master, as expected. 2018-08-15 15:16:58.061 UTC [6036] LOG: restored log file "0000000100000007000000E8" from archive 2018-08-15 15:16:58.108 UTC [6036] LOG: redo starts at 7/E8678300 2018-08-15 15:16:58.731 UTC [6087] LOG: archive command failed with exit code 1 2018-08-15 15:16:58.731 UTC [6087] DETAIL: The failed archive command was: ssh backup test ! -f backup/postgresql/Y/0000000100000007000000E8 && scp pg_xlog/0000000100000007000000E8 backup:backup/postgresql/Y/0000000100000007000000E8 scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory 2018-08-15 15:16:58.823 UTC [6036] LOG: consistent recovery state reached at 7/E9000098 2018-08-15 15:16:58.823 UTC [6036] LOG: invalid resource manager ID 22 at 7/E9000098 2018-08-15 15:16:58.824 UTC [6035] LOG: database system is ready to accept read only connections 2018-08-15 15:16:58.855 UTC [6094] FATAL: could not connect to the primary server: could not connect to server: Connection refused But note that in the middle of that log fragment it has tried to write WAL segment E8 to its (output) WAL archive, which is used to replicate to the downstream system Z, and it has found that that segment already exists there. The file that's already there has the same checksum as the E8 file in the pg_xlog directory. Somehow it has forgotten that it has already archived that segment. It keeps retrying this archive command. I try again to promote it: 2018-08-15 15:17:14.424 UTC [6036] LOG: received promote request 2018-08-15 15:17:14.424 UTC [6036] LOG: redo done at 7/E9000028 scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory scp: backup/postgresql/archivedir/00000002.history: No such file or directory 2018-08-15 15:17:15.832 UTC [6036] LOG: selected new timeline ID: 2 2018-08-15 15:17:15.884 UTC [6036] LOG: archive recovery complete scp: backup/postgresql/archivedir/00000001.history: No such file or directory 2018-08-15 15:17:16.612 UTC [6036] LOG: MultiXact member wraparound protections are now enabled 2018-08-15 15:17:16.614 UTC [6035] LOG: database system is ready to accept connections I can now connect to the db and do read/write operations, but it continues to try and fail to re-write segment E8 to its archive. Eventually I manually delete E8 from the archive; it then succeeds at writing it with the same content as before and seems to continue to function normally. Questions: - Can all of this be blamed on the failure of the first attempt to promote due to the wrong permissions on recovery.conf? - Should my archive_command detect the case where it is asked to write the same file again with the same contents, and report success in that case? - Is this a bug? Thanks, Phil.