Hello,
I tried to run pg_basebackup. Return value is 1.
How to find out its reason?
(I suspect that some wal after backup is missing- but how to find out the real reason? How to fix it?)
thanks, Andrej
--------------details:
environment: CentOS 6.7, postgres 9.5.1
( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16), 64-bit)
I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were issued from a script:
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X fetch" 2>${LOG_FILE}.stderr >> ${LOG_FILE}
# echo $?
1 <--------------pg_basebackup failed!
# cat log.stderr
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E3000028 on timeline 1
WARNING: skipping special file "./pg_hba.conf"
WARNING: skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E30000F8
pg_basebackup: base backup completed <------------------no reason for pg_basebackup failure!
# cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
# su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
# tail /pg_data/pg_log/postgresql-Fri.log
`pg_xlog/0000000100000000000000E2' -> `../backups/arc/0000000100000000000000E2'
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING: skipping special file "./pg_hba.conf"
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING: skipping special file "./pg_hba.conf.save" <---------------recorded in pg_log on master node and copied by pg_basebackup (note time difference between two servers)
2016-04-15 23:15:02 CEST:@:[23321] LOG: database system was interrupted; last known up at 2016-04-15 23:15:10 CEST
2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL: the database system is starting up
2016-04-15 23:15:03 CEST:@:[23321] LOG: entering standby mode
2016-04-15 23:15:03 CEST:@:[23321] LOG: database system was not properly shut down; automatic recovery in progress <---------something missing from pg_basebackup
2016-04-15 23:15:03 CEST:@:[23321] LOG: redo starts at 0/E3000028
2016-04-15 23:15:03 CEST:@:[23321] LOG: consistent recovery state reached at 0/E4000000
2016-04-15 23:15:03 CEST:@:[23295] LOG: database system is ready to accept read only connections
2016-04-15 23:15:03 CEST:@:[23356] LOG: started streaming WAL from primary at 0/E4000000 on timeline 1
-------second trial
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X stream"
# echo $?
1
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E5000028 on timeline 1
pg_basebackup: starting background WAL receiver
WARNING: skipping special file "./pg_hba.conf"
WARNING: skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E50000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: could not wait for child process: No child processes <----what does this mean? I think it failed to start process to fetching wal logs created during backup: but neither on master node neither on pg_basebackup output here is any information about reason.. (max_wal_senders on master is 10: I see no reason to fail).
postgres logs:
`pg_xlog/0000000100000000000000E4' -> `../backups/arc/0000000100000000000000E4'
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING: skipping special file "./pg_hba.conf"
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING: skipping special file "./pg_hba.conf.save"
2016-04-15 23:35:01 CEST:@:[28926] LOG: database system was interrupted; last known up at 2016-04-15 23:35:09 CEST
2016-04-15 23:35:01 CEST:postgres@postgres:[28938] FATAL: the database system is starting up
2016-04-15 23:35:02 CEST:@:[28926] LOG: entering standby mode
2016-04-15 23:35:02 CEST:@:[28926] LOG: database system was not properly shut down; automatic recovery in progress <------------this means something missing from pg_basebackup
2016-04-15 23:35:02 CEST:@:[28926] LOG: redo starts at 0/E5000028
2016-04-15 23:35:02 CEST:@:[28926] LOG: consistent recovery state reached at 0/E6000000
2016-04-15 23:35:02 CEST:@:[28904] LOG: database system is ready to accept read only connections
2016-04-15 23:35:02 CEST:@:[28989] LOG: started streaming WAL from primary at 0/E6000000 on timeline 1
postgres params on master node:
log_line_prefix = '%t:%u@%d:[%p] '
logging_collector = on
wal_buffers = 16MB
max_wal_size = 200MB
log_temp_files = 1MB
max_connections = 170
shared_buffers = 512MB
effective_cache_size = 1500MB
work_mem = 48MB
log_lock_waits = on
log_min_duration_statement = 10000
shared_preload_libraries = 'pg_stat_statements'
include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
wal_level = hot_standby
archive_mode = on
max_wal_senders = 10
hot_standby = on
wal_keep_segments = 128
archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
restart_after_crash = off
hot_standby_feedback = on