base backup/restore + streaming replication => weirdness

domehead100 <domehead100@xxxxxxxxx> · Fri, 22 Feb 2013 14:11:03 -0800 (PST)

I have a smallish Postgres 9.0 database with Primary and Standby instances.

These instances are set up with streaming replication from the Primary to
the Standby.  The primary archives WAL files to a shared directory that is
accessible from the Standby.  This is a hot standby, so transactions are
received over TCP.

We had an issue this week where the shared directory where WAL files were
being archived (/pgsql_wal) ran out of space.

To restart replication, I performed a base backup on Primary (tar $PGDATA to
/pgsql_wal) and then performed a base restore (untar) on Standby.

After this, the Standby is staying in recovery mode (recovery.conf never
gets changed to recovery.done), and my check_replication.sh script shows
strange results.  The sequence number for the Primary (first item below) is
totally different from either the received or applied sequence numbers on
the Standby. 

Primary:
 pg_current_xlog_location
--------------------------
 1E/D5C40A40           <= this looks strange
(1 row)

Standby, last received:
 pg_last_xlog_receive_location
-------------------------------
 E/BF68BD08
(1 row)

Standby, last applied:
 pg_last_xlog_replay_location
------------------------------
 E/BF68BD08
(1 row)

I can connect to the Standby, and a select query seems to indicate that the
databases are in sync (they return the same value for max(<primary_key>) on
a table that is constantly receiving inserts).

One concern is that my tar command apparently did not exclude the files in
$PGDATA/pg_xlog, so those got untarred on the Standby.  Could that be a
problem?

Here's my basebackup.sh:
#! /bin/sh
# Base Backup script for streaming replication

BACKUP_FILE=/pgsql_wal/backup/pg_base_backup.tgz

psql -c "SELECT pg_start_backup('$BACKUP_FILE', true)" postgres

rm -rf $BACKUP_FILE

nice -n 10 tar czvpf $BACKUP_FILE --exclude={"$PGDATA/pg_xlog/*"} $PGDATA

psql -c "SELECT pg_stop_backup()" postgres

And here's my baserestore.h:
#! /bin/sh
# Base Recovery script for streaming replication (run on Standby)
# Run as postgres user
# Postgres should be stopped

DATE=`date +%Y_%M_%d`
CONF_BACKUP_DIR=/tmp/pgsql_conf_backup_$DATE
BASE_BACKUP_FILE=/pgsql_wal/backup/pg_base_backup.tgz

#backup config files
mkdir $CONF_BACKUP_DIR
cp $PGDATA/*.conf $CONF_BACKUP_DIR
cp $PGDATA/recovery.done $CONF_BACKUP_DIR

#blow away existing data directory
rm -rf $PGDATA

#untar base backup file
cd /
tar xzvf $BASE_BACKUP_FILE

#copy configs back
cp $CONF_BACKUP_DIR/*.conf $PGDATA
cp $CONF_BACKUP_DIR/recovery.done $PGDATA/recovery.conf

--
View this message in context: http://postgresql.1045698.n5.nabble.com/base-backup-restore-streaming-replication-weirdness-tp5746342.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin