Re: During promotion, new master tries to archive same segment twice

"Phil Endecott" <spam_from_pgsql_lists@xxxxxxxxxxxx> · Thu, 16 Aug 2018 09:48:15 +0100

Adrian Klaver wrote:
On 08/15/2018 01:25 PM, Phil Endecott wrote:
Dear Experts,

Here is my latest issue with replication:

I have 3 systems, X, Y and Z.  Initially X is replicated to Y
and Y is replicated to Z; in each case the replication involves
log-shipping using archive_command and restore_command (via a
4th system called "backup") and then streaming.

The plan is to shut down X and to promote Y to be the new master.

I shut down X, and as expected Y complains that it con no longer
connect to it:

2018-08-15 15:10:58.785 UTC [617] LOG:  replication terminated by primary server
2018-08-15 15:10:58.785 UTC [617] DETAIL:  End of WAL reached on timeline 1 at 7/E9000098.
2018-08-15 15:10:58.785 UTC [617] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
2018-08-15 15:10:59.928 UTC [354] LOG:  invalid resource manager ID 22 at 7/E9000098
2018-08-15 15:11:00.092 UTC [5856] FATAL:  could not connect to the primary server: FATAL:  the database system is shutting down

Note that in the middle of that, after streaming has failed it
tries to fetch the next WAL segment, E9, from the backup; this
fails because this segment doesn't exist.

Then I pg_ctl promote on Y:

2018-08-15 15:12:11.888 UTC [354] LOG:  received promote request
2018-08-15 15:12:11.888 UTC [354] LOG:  redo done at 7/E9000028
2018-08-15 15:12:11.888 UTC [354] LOG:  last completed transaction was at log time 2018-08-15 14:45:05.961153+00
scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
scp: backup/postgresql/archivedir/00000002.history: No such file or directory
2018-08-15 15:12:13.316 UTC [354] LOG:  selected new timeline ID: 2
2018-08-15 15:12:13.368 UTC [354] FATAL:  could not open file "recovery.conf": Permission denied

So in the process of exiting recovery mode it tries to rename
recovery.conf to recovery.done, but it fails to do so because
my recovery.conf was owned by root, not user postgres.  Oooops.
Perhaps it would be appropriate to check for that when it
initially reads recovery.conf?  Anyway, I fix the permissions
and restart it.  It starts in recovery mode and cannot connect
to the master, as expected.

2018-08-15 15:16:58.061 UTC [6036] LOG:  restored log file "0000000100000007000000E8" from archive
2018-08-15 15:16:58.108 UTC [6036] LOG:  redo starts at 7/E8678300
2018-08-15 15:16:58.731 UTC [6087] LOG:  archive command failed with exit code 1
2018-08-15 15:16:58.731 UTC [6087] DETAIL:  The failed archive command was: ssh backup test ! -f backup/postgresql/Y/0000000100000007000000E8 && scp pg_xlog/0000000100000007000000E8 backup:backup/postgresql/Y/0000000100000007000000E8
scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory

The above is not clear to me. My best guess:

1) The ssh backup ... is archiving WAL's from Y to Y specific archive on 
'backup'?

Yes, there are two archive directories on "backup", one called
"archivedir" which is for the X-to-Y replication and a second
called "Y" which is for the Y-to-Z replication.

2) The scp: backup ... is the restore_command fetching archived WAL's 
that originated on X and where sent to backup/postgresql/archivedir/ ?

Yes.

3) If 2) is true why is it showing up in the error for the archive command?

It's not part of the error for the archive command; it's just the
next thing in the log file.  Y is (1) trying write WAL segment
E8 to the Y-to-Z archive and failing because it is already there and
(2) trying to read WAL segment E9 from the X-to-Y archive and failing
because it doesn't exist (and never will).  (1) is the problem, (2) is
expected.

Can you show us your actual archive and restore commands?

On Y:

restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'
archive_command = 'ssh backup test ! -f backup/postgresql/Y/%f && scp 
%p backup:backup/postgresql/Y/%f'

2018-08-15 15:16:58.823 UTC [6036] LOG:  consistent recovery state reached at 7/E9000098
2018-08-15 15:16:58.823 UTC [6036] LOG:  invalid resource manager ID 22 at 7/E9000098
2018-08-15 15:16:58.824 UTC [6035] LOG:  database system is ready to accept read only connections
2018-08-15 15:16:58.855 UTC [6094] FATAL:  could not connect to the primary server: could not connect to server: Connection refused

I can now connect to the db and do read/write operations, but it
continues to try and fail to re-write segment E8 to its archive.

Eventually I manually delete E8 from the archive; it then
succeeds at writing it with the same content as before and seems to
continue to function normally.

Which archive?

The Y-to-Z archive, backup:backup/postgresql/Y/...E8.

Questions:

- Can all of this be blamed on the failure of the first attempt
to promote due to the wrong permissions on recovery.conf?

- Should my archive_command detect the case where it is asked to
write the same file again with the same contents, and report
success in that case?

- Is this a bug?

Regards, Phil.