Re: Problems rebuilding slave using pg_basebackup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Nov 8, 2017 5:59 AM, "Douglas Reed" <douglas@xxxxxxxxxxx> wrote:

Hi

Sorry if this email was aready received but I sent it originally from my own email address
but received no response from the moderator so I assume that it may have got caught in the 
filter.

We are having a number of problems when we attempt to rebuild our  slave from its master

We have made about three attempts without success (using a proven set of notes)

It's been rebuilt several times over the last few months although the time between 
pg_basebackup being keyed and it actually copying data can be up to six minutes. 

Try setting checkpoint mode to fast in the pg_basebackup command.  (-c fast) so it won't wait passively for a checkpoint before beginning basebackup. 

And after completion the time taken from database startup to psql availability 
can also be several minutes while it processes any remaining logs.

Based on how busy your primary is, this is expected. What is the WAL generation rate approximately for your database? 


Both machines are virtuals and are based with a leading cloud provider

Have you checked performance metrics like IO, CPU load, etc? Usually you will be able to view some basic metics out of the box.


OS Linux Centos6 (6.8 Final)

pg version 9.5.4

Quite a few pg_basebackup bugs were fixed in the later minor versions, especially 9.5.6:

Fix pg_basebackup's rate limiting in the presence of slow I/O (Antonin Houska)

Fix possible pg_basebackup failure on standby server when including WAL files (Amit Kapila, Robert Haas)

https://www.postgresql.org/docs/9.5/static/release-9-5-6.html

Always recommend keeping minor version up to date (9.5.9 is the latest) since it just needs a quick restart of the database. Won't be surprised if this alone fixes your issue. 


pg WAL settings on the master database

     max_wal_senders                = 5            
     max_wal_size                   = 4GB          
     min_wal_size                   = 256MB        
     wal_block_size                 = 8192         
     wal_buffers                    = 1MB          
     wal_compression                = off          
     wal_keep_segments              = 32           
     wal_level                      = hot_standby  
     wal_log_hints                  = off          
     wal_receiver_status_interval   = 10s          
     wal_receiver_timeout           = 1min         
     wal_retrieve_retry_interval    = 5s           
     wal_segment_size               = 16MB         
     wal_sender_timeout             = 1min         
     wal_sync_method                = fdatasync    
     wal_writer_delay               = 200ms        


Message from pg_basebackup

    [postgres@xxxxxxxxxx]$ pg_basebackup -h -IP_HIDDEN- -D /var/lib/pgsql/9.5/data -P -U postgres --xlog-method=stream
    pg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
    269061959/269164935 kB (99%), 1/1 tablespace
    pg_basebackup: child process exited with error 1


Relevant error messages from master's log

    Nov  7 11:52:32 o8-data1 postgres[28558]: [6-1] user=[unknown],db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG:  connection received: host=-IP_HIDDEN- port=41498
    Nov  7 11:52:32 o8-data1 postgres[28558]: [7-1] user=postgres,db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG:  replication connection authorized: user=postgres
    Nov  7 13:51:44 o8-data1 postgres[28558]: [8-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG:  could not send data to client: Broken pipe
    Nov  7 13:51:44 o8-data1 postgres[28558]: [9-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- ERROR:  base backup could not send data, aborting backup
    Nov  7 13:51:44 o8-data1 postgres[28558]: [10-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- FATAL:  connection to client lost
    Nov  7 13:51:44 o8-data1 postgres[28558]: [11-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG:  disconnection: session time: 1:59:11.943 user=postgres database= host=-IP_HIDDEN- port=41498

    Nov  7 13:54:48 o8-data1 postgres[35445]: [6-1] user=[unknown],db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG:  connection received: host=-IP_HIDDEN- port=44040
    Nov  7 13:54:48 o8-data1 postgres[35445]: [7-1] user=postgres,db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG:  replication connection authorized: user=postgres
    Nov  7 15:09:20 o8-data1 postgres[35445]: [8-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG:  could not send data to client: Broken pipe
    Nov  7 15:09:20 o8-data1 postgres[35445]: [9-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- ERROR:  base backup could not send data, aborting backup
    Nov  7 15:09:20 o8-data1 postgres[35445]: [10-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- FATAL:  connection to client lost
    Nov  7 15:09:20 o8-data1 postgres[35445]: [11-1] user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG:  disconnection: session time: 1:14:31.925 user=postgres database= host=-IP_HIDDEN- port=44040

Many thanks in advance



--
Douglas Reed
DBA
FSB Technology



What is your archive_command and full_page_writes set to? Also, what is the value of checkpoint_segments and checkpoint_timeout? 

Try increasing wal_sender_timeout before running pg_basebackup. 

Also, if you are sending/storing WAL files anywhere besides the master, once your pg_basebackup command fails, try copying those missing files manually to path given in restore_command parameter in the secondary's recovery.conf. 

A --slot option was added to pg_basebackup in 9.6 so the command using -x stream could connect to the replication slot used by secondary on the master to make sure no way files go missing. 

[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux