Both Master and Slave are EC2 instances with Postgres version 9.0.04 and Ubuntu 10.04. As per my understanding, DB replication was stalled for around 3 months. On Master new 16 MB WAL is created in every 2-5 minutes.
For replication, I am following link:
http://wiki.postgresql.org/wiki/Streaming_Replication
I am also referring:
http://www.postgresql.org/docs/9.0/static/continuous-archiving.html
http://www.depesz.com/index.php/2010/03/11/setting-wal-replication/
Before starting backup, I ensured the following:
- On Slave I cleared contents of 'pg_xlog/*'.
- Both master and Slave have following in postgresql.conf:
wal_level = archive
hot_standby = off
- In postgresql.conf master has:
max_wal_senders = 5
wal_keep_segments = 10
- On slave recovery.conf has following 3 parameters:
standby_mode = 'on'
primary_conninfo = 'host=10.218.61.143 port=5432 user=postgres'
trigger_file = '/data/db/trigger_failover'
I used following commands for backup. And as soon as backup finished, I immediately started postgres on Slave.
psql -c "SELECT pg_start_backup('label', true)";
rsync -av --progress /data/db/main/ 10.40.89.9:/data/db/main/ --exclude 'pg_log/*' --exclude 'pg_xlog/*' --exclude postmaster.pid --exclude pg_hba.conf --exclude postgresql.conf;
psql -c "SELECT pg_stop_backup()";
On Slave I see following process running:
$ ps -ef | grep postgres
postgres 1895 1 0 Nov18 ? 00:00:00 /usr/lib/postgresql/9.0/bin/postgres -D /data/db/main -c config_file=/etc/postgresql/9.0/main/postgresql.conf
postgres 1896 1895 0 Nov18 ? 00:00:00 postgres: startup process waiting for 0000000100000B110000000D
On Slave, log showd that it is unable to find the requested WAL segment
$ tail /var/log/postgresql/postgresql-9.0-main.log
2011-11-19 07:09:50 UTC LOG: streaming replication successfully connected to primary
2011-11-19 07:09:50 UTC FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 0000000100000B110000000D has already been removed
I confirmed that requested WAL segment 0000000100000B110000000D doesn't exist on Master.
On Master, process listing shows:
$ ps -ef | grep postgres
postgres 25395 25389 0 Nov14 ? 00:00:06 postgres: archiver process last was 0000000100000B1F00000081
Log on master also indicate that requested WAL segment was removed:
$ tail postgresql-2011-11-18_221110.csv
2011-11-18 23:15:01.355 PST,"postgres","",20523,"10.40.89.9:46157",4ec75775.502b,1,"authentication",2011-11-18 23:15:01 PST,5/703238,0,LOG,00000,"replication connection authorized: user=postgres host=10.40.89.9 port=46157",,,,,,,,,""
2011-11-18 23:15:01.356 PST,"postgres","",20523,"10.40.89.9:46157",4ec75775.502b,2,"startup",2011-11-18 23:15:01 PST,5/0,0,FATAL,58P01,"requested WAL segment 0000000100000B110000000D has already been removed",,,,,,,,,""
On Slave, I even tried deleting everything under /data/db/main, and took backup again but the issue still persists.
It seems it is not an issue because slow Slave is not able to catch to master. Because,
1) This happens as soon as Slave DB is started. So slave doesn't even get the first WAL file.
2) Both machines are in same zone of EC2 and backup happens at fairly good speed. So network connectivity issues are also ruled out.
I searched on various forums, where people encountered similar error, however in all such issues WAL file existed on Master. In this case Master is not retaining the WAL file required by the Slave.
I am unable to understand as to why Master is not retaining the WAL files. Any pointer/suggestions would be helpful.