On Nov 8, 2017 5:59 AM, "Douglas Reed" <douglas@xxxxxxxxxxx> wrote:
HiSorry if this email was aready received but I sent it originally from my own email addressbut received no response from the moderator so I assume that it may have got caught in thefilter.We are having a number of problems when we attempt to rebuild our slave from its masterWe have made about three attempts without success (using a proven set of notes)It's been rebuilt several times over the last few months although the time betweenpg_basebackup being keyed and it actually copying data can be up to six minutes.
Try setting checkpoint mode to fast in the pg_basebackup command. (-c fast) so it won't wait passively for a checkpoint before beginning basebackup.
And after completion the time taken from database startup to psql availabilitycan also be several minutes while it processes any remaining logs.
Based on how busy your primary is, this is expected. What is the WAL generation rate approximately for your database?
Both machines are virtuals and are based with a leading cloud provider
Have you checked performance metrics like IO, CPU load, etc? Usually you will be able to view some basic metics out of the box.
OS Linux Centos6 (6.8 Final)pg version 9.5.4
Quite a few pg_basebackup bugs were fixed in the later minor versions, especially 9.5.6:
Fix pg_basebackup's rate limiting in the presence of slow I/O (Antonin Houska)
Fix possible pg_basebackup failure on standby server when including WAL files (Amit Kapila, Robert Haas)
Always recommend keeping minor version up to date (9.5.9 is the latest) since it just needs a quick restart of the database. Won't be surprised if this alone fixes your issue.
pg WAL settings on the master databasemax_wal_senders = 5max_wal_size = 4GBmin_wal_size = 256MBwal_block_size = 8192wal_buffers = 1MBwal_compression = offwal_keep_segments = 32wal_level = hot_standbywal_log_hints = offwal_receiver_status_interval = 10swal_receiver_timeout = 1minwal_retrieve_retry_interval = 5swal_segment_size = 16MBwal_sender_timeout = 1minwal_sync_method = fdatasyncwal_writer_delay = 200msMessage from pg_basebackup[postgres@xxxxxxxxxx]$ pg_basebackup -h -IP_HIDDEN- -D /var/lib/pgsql/9.5/data -P -U postgres --xlog-method=streampg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedlyThis probably means the server terminated abnormallybefore or while processing the request.269061959/269164935 kB (99%), 1/1 tablespacepg_basebackup: child process exited with error 1Relevant error messages from master's logNov 7 11:52:32 o8-data1 postgres[28558]: [6-1] user=[unknown],db=[unknown],app=[unknown]client=-IP_ HIDDEN- LOG: connection received: host=-IP_HIDDEN- port=41498 Nov 7 11:52:32 o8-data1 postgres[28558]: [7-1] user=postgres,db=[unknown],app=[unknown]client=-IP_ HIDDEN- LOG: replication connection authorized: user=postgres Nov 7 13:51:44 o8-data1 postgres[28558]: [8-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- LOG: could not send data to client: Broken pipe Nov 7 13:51:44 o8-data1 postgres[28558]: [9-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- ERROR: base backup could not send data, aborting backup Nov 7 13:51:44 o8-data1 postgres[28558]: [10-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- FATAL: connection to client lost Nov 7 13:51:44 o8-data1 postgres[28558]: [11-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- LOG: disconnection: session time: 1:59:11.943 user=postgres database= host=-IP_HIDDEN- port=41498 Nov 7 13:54:48 o8-data1 postgres[35445]: [6-1] user=[unknown],db=[unknown],app=[unknown]client=-IP_ HIDDEN- LOG: connection received: host=-IP_HIDDEN- port=44040 Nov 7 13:54:48 o8-data1 postgres[35445]: [7-1] user=postgres,db=[unknown],app=[unknown]client=-IP_ HIDDEN- LOG: replication connection authorized: user=postgres Nov 7 15:09:20 o8-data1 postgres[35445]: [8-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- LOG: could not send data to client: Broken pipe Nov 7 15:09:20 o8-data1 postgres[35445]: [9-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- ERROR: base backup could not send data, aborting backup Nov 7 15:09:20 o8-data1 postgres[35445]: [10-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- FATAL: connection to client lost Nov 7 15:09:20 o8-data1 postgres[35445]: [11-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_ HIDDEN- LOG: disconnection: session time: 1:14:31.925 user=postgres database= host=-IP_HIDDEN- port=44040 Many thanks in advance--Douglas ReedDBAFSB Technology
Try increasing wal_sender_timeout before running pg_basebackup.
Also, if you are sending/storing WAL files anywhere besides the master, once your pg_basebackup command fails, try copying those missing files manually to path given in restore_command parameter in the secondary's recovery.conf.
A --slot option was added to pg_basebackup in 9.6 so the command using -x stream could connect to the replication slot used by secondary on the master to make sure no way files go missing.