Re: a new standby server promotes itself to primary without cause

John Scalia <jayknowsunix@xxxxxxxxx> · Thu, 8 Oct 2015 10:43:09 -0500

That fixed it, thanks Scott. I wish there had been more detail about that parameter on the postgresql website (maybe I could add it.) I commented out that parameter and successfully restarted the server. This time it remained as a standby.
Thanks again,
Jay

On Thu, Oct 8, 2015 at 10:04 AM, Scott Mead <scottm@xxxxxxxxxxx> wrote:
On Thu, Oct 8, 2015 at 10:59 AM, John Scalia <jayknowsunix@xxxxxxxxx> wrote:
Yep, they're both in different directories and on different ports.  Allow me to note, that I have both handbuilt this file from the sample provided in /usr/pgsql-9.4/share and used pg_basebackup's -R option to automatically generate it. Doesn't seem to make a difference. Here's the recovery.conf file I'm using on the standby.
restore_command = 'cp /var/lib/pgsql/9.4/share/%f %p'           # e.g. 'cp /mnt/server/archivedir/%f %p'
archive_cleanup_command = '/usr/pgsql-9.4/bin/pg_archivecleanup /var/lib/pgsql/9.4/share %r'
#recovery_end_command = ''
#recovery_target_name = ''      # e.g. 'daily backup 2011-01-26'
#recovery_target_time = ''      # e.g. '2004-07-14 22:39:00 EST'
#recovery_target_xid = ''
#recovery_target_inclusive = true
recovery_target = 'immediate'

I believe you should omit recovery_target.  From http://www.postgresql.org/docs/9.4/static/recovery-target-settings.html

recovery_target = 'immediate'
This parameter specifies that recovery should end as soon as a consistent state is reached, i.e. as early as possible. When restoring from an online backup, this means the point where taking the backup ended.

--
Scott Mead
Sr. Architect
OpenSCGPostgreSQL, Java & Linux Experts

http://openscg.com

pause_at_recovery_target = false
standby_mode = on
primary_conninfo = 'host=10.32.10.175 port=5432 user=replicator'                # e.g. 'host=localhost port=5432'
primary_slot_name = 'replicant'
trigger_file = '/var/lib/pgsql/9.4/share/trigger_file'
#recovery_min_apply_delay = 0

I have also gone through the master's postgresql.conf file several times looking for problems, and I've verified the replication slot exists.

Thanks for your assistance,
Jay

On Thu, Oct 8, 2015 at 9:51 AM, Keith Fiske <keith@xxxxxxxxxx> wrote:

On Thu, Oct 8, 2015 at 9:24 AM, John Scalia <jayknowsunix@xxxxxxxxx> wrote:
Hi all,
We have an ongoing discussion here about retiring slony and implementing streaming synchronous replication. As part of this effort, I've been trying to build a small cluster with one primary and one standby. Yes, I know about the potential problems with a single standby, but this is only to provide configuration parameters to my sysadmins. 

What I've tried is building the two servers on a single sandbox server. The primary has been running correctly for some time. The standby starts, but quickly turns itself into a primary without any trigger file being present. My process is:

1) build the new standby in a different directory with pg_basebackup
2) edit the standby's postgresql.conf so the standby has a different port.
3) start the standby.
4) immediately after starting the standby, I see a recovery.done file in that PGDATA dir.

Here's part of the debug2 level log file from that standby, but I'm not seeing why this standby is becoming a primary. Only that it is:

[2015-10-08 07:18:19.448 CDT] @: LOG:  00000: database system is ready to accept read only connections
[2015-10-08 07:18:19.448 CDT] @: LOCATION:  sigusr1_handler, postmaster.c:4857
cp: cannot stat `/var/lib/pgsql/9.4/share/00000002.history': No such file or directory
[2015-10-08 07:18:19.477 CDT] @: DEBUG:  00000: could not restore file "00000002.history" from archive: child process exited with exit code 1
[2015-10-08 07:18:19.477 CDT] @: LOCATION:  RestoreArchivedFile, xlogarchive.c:304
[2015-10-08 07:18:19.477 CDT] @: LOG:  00000: selected new timeline ID: 2
[2015-10-08 07:18:19.477 CDT] @: LOCATION:  StartupXLOG, xlog.c:7107
cp: cannot stat `/var/lib/pgsql/9.4/share/00000001.history': No such file or directory
[2015-10-08 07:18:19.485 CDT] @: DEBUG:  00000: could not restore file "00000001.history" from archive: child process exited with exit code 1
[2015-10-08 07:18:19.485 CDT] @: LOCATION:  RestoreArchivedFile, xlogarchive.c:304
[2015-10-08 07:18:19.792 CDT] @: LOG:  00000: archive recovery complete
[2015-10-08 07:18:19.792 CDT] @: LOCATION:  exitArchiveRecovery, xlog.c:5417
[2015-10-08 07:18:19.798 CDT] @: DEBUG:  00000: performing replication slot checkpoint
[2015-10-08 07:18:19.798 CDT] @: LOCATION:  CheckPointReplicationSlots, slot.c:794
[2015-10-08 07:18:19.814 CDT] @: DEBUG:  00000: attempting to remove WAL segments older than log file 0000000000000006000000D8
[2015-10-08 07:18:19.814 CDT] @: LOCATION:  RemoveOldXlogFiles, xlog.c:3775
[2015-10-08 07:18:19.814 CDT] @: DEBUG:  00000: SlruScanDirectory invoking callback on pg_multixact/offsets/0000
[2015-10-08 07:18:19.814 CDT] @: LOCATION:  SlruScanDirectory, slru.c:1307
[2015-10-08 07:18:19.815 CDT] @: DEBUG:  00000: SlruScanDirectory invoking callback on pg_multixact/members/0000
[2015-10-08 07:18:19.815 CDT] @: LOCATION:  SlruScanDirectory, slru.c:1307
[2015-10-08 07:18:19.815 CDT] @: DEBUG:  00000: SlruScanDirectory invoking callback on pg_multixact/offsets/0000
[2015-10-08 07:18:19.815 CDT] @: LOCATION:  SlruScanDirectory, slru.c:1307
[2015-10-08 07:18:19.893 CDT] @: DEBUG:  00000: attempting to remove WAL segments newer than log file 0000000200000006000000E1
[2015-10-08 07:18:19.893 CDT] @: LOCATION:  RemoveNonParentXlogFiles, xlog.c:5458
[2015-10-08 07:18:19.899 CDT] @: DEBUG:  00000: oldest MultiXactId member is at offset 1
[2015-10-08 07:18:19.899 CDT] @: LOCATION:  SetOffsetVacuumLimit, multixact.c:2677
[2015-10-08 07:18:19.899 CDT] @: LOG:  00000: MultiXact member wraparound protections are now enabled
[2015-10-08 07:18:19.899 CDT] @: LOCATION:  DetermineSafeOldestOffset, multixact.c:2587
[2015-10-08 07:18:19.899 CDT] @: DEBUG:  00000: MultiXact member stop limit is now 4294914944 based on MultiXact 1
[2015-10-08 07:18:19.899 CDT] @: LOCATION:  DetermineSafeOldestOffset, multixact.c:2590
[2015-10-08 07:18:19.899 CDT] @: DEBUG:  00000: release all standby locks
[2015-10-08 07:18:19.899 CDT] @: LOCATION:  StandbyReleaseAllLocks, standby.c:666
[2015-10-08 07:18:19.903 CDT] @: LOG:  00000: starting background worker process "powa"
[2015-10-08 07:18:19.904 CDT] @: LOCATION:  do_start_bgworker, postmaster.c:5412
[2015-10-08 07:18:19.905 CDT] @: LOG:  00000: database system is ready to accept connections

The one thing that's a little weird on this configuration is that both the primary and the standby are on the same system, and I've never done a configuration this way before. Usually, these are on different systems, but this was done just to test some configuration parameters. Can a standby and a primary not exist together on the same system? In any event, there is no trigger file as specified in the recovery.conf on this system. So, why is promotion occuring?
--
Jay

Long as they're on different ports and using different data directories, you should be fine. 
Can you share the contents of your recovery.conf file?

--
Keith Fiske
Database Administrator
OmniTI Computer Consulting, Inc.
http://www.keithf4.com