9.0 Streaming Replication Problem to two slaves

Michael Best <mbest@xxxxxxxxxxxxx> · Mon, 25 Jul 2011 11:38:12 -0600

I have a master server and two slave servers, one in the same rack and 
one in another data center that has a normal latency of about 15ms.

Both master and slaves are running CentOS 5.6 x86_64 with:
postgresql90-server-9.0.4-1PGDG.rhel5.x86_64 from
http://yum.pgrpms.org

The master server is using:
wal_level = hot_standby
checkpoint_segments = 64
max_wal_senders = 10
wal_keep_segments = 512

(good for 8-12 hours of wal segments, way more than required, but I have 
been trying to debug this)

The slaves are using:
hot_standby = on
max_standby_streaming_delay = 60s

I have the servers configured, and get the replication up and running, 
and then it will run for the better part of a day, and then the slaves 
appear to stop receiving or requesting updates, there doesn't appear to 
be anything in the logs other than

Jul 23 09:32:47 backupdb postgres[23010]: [2-1] FATAL:  terminating 
connection due to conflict with recovery
Jul 23 09:32:47 backupdb postgres[23010]: [2-2] DETAIL:  User query 
might have needed to see row versions that must be removed.
Jul 23 09:32:47 backupdb postgres[23010]: [2-3] HINT:  In a moment you 
should be able to reconnect to the database and repeat your command.

I don't have any idea what might be causing the problem, I was 
considering that the problem might be something to do with disk access 
not being fast enough on the slaves and when there is competition for 
disk access while copying other backup files to those servers that the 
resulting slowdown to the disks is causing recovery to falter.

I am monitoring the master and slaves for synchronization using a script 
which selects some data from each server and compares the result.

To construct the slaves I am using the following script:

SERVERS="server1.example.com server2.example.com"

if [ `whoami` == 'postgres' ]
then
  psql -d postgres -c "checkpoint; select pg_switch_xlog();"
  psql -c "SELECT pg_start_backup('backup', true)";
else
  su - postgres -c "psql -c \"checkpoint; select pg_switch_xlog();\";"
  su - postgres -c "psql -c \"SELECT pg_start_backup('backup', true)\";"
fi

for server in $SERVERS
do
    ssh root@$server /etc/init.d/postgresql-9.0 stop
    rsync -zav --delete /var/lib/pgsql/9.0/data/ 
root@$server:/var/lib/pgsql/9.0/data/ --exclude postmaster.pid --exclude 
recovery.conf --exclude postgresql.conf --exclude pg_hba.conf
done

if [ `whoami` == 'postgres' ]
then
  # the statement_timeout kills the command after 60 seconds
  # this is a hack, but otherwise it hangs indefiniately
  psql -c "SET statement_timeout = 60000; SELECT pg_stop_backup()"
else
  su - postgres -c "psql -c \"SELECT pg_stop_backup()\""
fi

for server in $SERVERS
do
    rsync -zav --delete /var/lib/pgsql/9.0/data/pg_xlog/ 
root@$server:/var/lib/pgsql/9.0/data/pg_xlog/

    ssh root@$server /etc/init.d/postgresql-9.0 start
done

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general