Streaming replication: replicant server not starting (9.4.4, Win 2008)

Tim Bowden <tim.bowden@xxxxxxxxxxxxxxx> · Wed, 31 Jan 2018 02:16:27 +0800

I've inherited a PG 9.4.4 install on Win 2008 that I'm wanting to
stream from (abt 80Gb on disk).  Everything seems to be working from
the "master" side.

I have an AMI of the master host (a vmware instance in our soon to
close data centre) that has been spun up in AWS (replicant).  In it's
"default" configuration (ie, exactly the same as master) it worked
fine.

I ran pg_basebackup on the master (from memory- details on work pc):
pg_basebackup -D "e:\\mybasebackup\\" -F t -R -X f -z -c fast \
 -h <ip_addr> -U <rep_user>

Base backup is created fine.  I've copied it up to AWS and extracted it
into the correct location (after ensuring it is empty) on the replicant
windows host (tar -xzvf base.tgz from a linux box with the pg data
drive mounted).  File metadata is changed during the extraction (ie,
file timestamps- could this possibly be an issue?) but permissions on
the win2008 replicant host have been set & double checked.

When I try to start the pg server on replicant, the log streaming from
master works (and keeps working till the server is rebooted), but the
server can't start for some reason.

Copy of pg log file on replicant:

2018-01-30 17:18:02 AWST LOG:  database system was shut down in
recovery at 2018-01-30 17:16:59 AWST
2018-01-30 17:18:02 AWST LOG:  entering standby mode
2018-01-30 17:18:02 AWST LOG:  redo starts at C26/84018A30
2018-01-30 17:18:02 AWST LOG:  consistent recovery state reached at
C26/8401C6D0
2018-01-30 17:18:02 AWST LOG:  invalid record length at C26/8401C6D0
2018-01-30 17:18:03 AWST LOG:  started streaming WAL from primary at
C26/84000000 on timeline 1
2018-01-30 17:18:03 AWST FATAL:  the database system is starting up
2018-01-30 17:18:04 AWST FATAL:  the database system is starting up
2018-01-30 17:18:05 AWST FATAL:  the database system is starting up
2018-01-30 17:18:06 AWST FATAL:  the database system is starting up
2018-01-30 17:18:07 AWST FATAL:  the database system is starting up
2018-01-30 17:18:08 AWST FATAL:  the database system is starting up

The last line is repeated for about a minute till it gives up.  It
leaves behind a bunch of processes that keep streaming log files as
master creates new wal records.  This keeps going till the replicant
box is rebooted (easiest way to clean up the processes and free up port
5432).

I can keep rebooting the replicant host and on retrying to start the pg
server, it will apply any new wal files that have been streamed, then
fail to start the server but keep streaming again.

The Windows service start command for PG: 

"C:\Program Files (x86)\PostgreSQL\9.4.4\bin\pg_ctl.exe" runservice -N
"postgresql-9.4" -D "E:\PostgreSQL\9.4.4\data" -w

Config files:

Master postgresql.conf:

dynamic_shared_memory_type = windows
wal_level = hot_standby	# Was "archive", but we'd like to do
hot...
archive_mode = on
archive_command = 'copy "%p"
"e:\\PostgreSQL\\9.4.4\\wal_archive\\%f"'  # Windows
max_wal_senders = 8
wal_keep_segments = 900 # Sick of losing old segments while I fix this
max_replication_slots = 8
log_line_prefix = '%t '
log_timezone = 'Australia/Perth'
datestyle = 'iso, dmy'
timezone = 'Australia/Perth'
lc_messages = 'English_Australia.1252'
lc_monetary = 'English_Australia.1252'
lc_numeric = 'English_Australia.1252'
lc_time = 'English_Australia.1252'
default_text_search_config = 'pg_catalog.english'
listen_addresses = '*'
port = 5432
max_connections = 300
shared_buffers = 500MB
work_mem = 32MB
maintenance_work_mem = 128MB
checkpoint_segments = 100
random_page_cost = 2.0
effective_cache_size = 1500MB
log_destination = 'stderr'
logging_collector = on
log_min_duration_statement = 500
log_line_prefix = '%t '

Replicant postgresql.conf (essentially a copy from master with minimal
changes):

dynamic_shared_memory_type = windows
wal_level = archive
max_standby_streaming_delay = 30s
wal_receiver_status_interval = 30s
log_line_prefix = '%t '
log_timezone = 'Australia/Perth'
datestyle = 'iso, dmy'
timezone = 'Australia/Perth'
lc_messages = 'English_Australia.1252'
lc_monetary = 'English_Australia.1252'
lc_numeric = 'English_Australia.1252'
lc_time = 'English_Australia.1252'
default_text_search_config = 'pg_catalog.english'
listen_addresses = '*'
port = 5432
max_connections = 300
shared_buffers = 500MB
work_mem = 32MB
maintenance_work_mem = 128MB
checkpoint_segments = 100
random_page_cost = 2.0
effective_cache_size = 1500MB
log_destination = 'stderr'
logging_collector = on
log_min_duration_statement = 500
log_line_prefix = '%t '

replicant recovery.conf:

standby_mode = 'on'
primary_conninfo = 'host=<master_ip> user=<rep_user>
password=<don't_tell> connect_timeout=10 application_name=replicant_3'
primary_slot_name = 'replicant_3'
recovery_min_apply_delay = 10
#restore_command = 'copy
"e:\\PostgreSQL\\9.4.4\\basebackup\\wal_archive\\%f" "%p" ' 
trigger_file = 'recovery_done.txt'

The restore_command was uncommented at first and worked fine. All new
updates now are provided by streaming.

Any ideas?  I'm tearing my hair out with this. The boxes are about as
identical as you can get, apart from the fact that one is running on
VMware and the other in AWS.

Thanks,
Tim Bowden