Postgres 9.0.4 replication issue: FATAL: requested WAL segment 0000000100000B110000000D has already been removed

Ashish Gupta <ashish.gupta.cal@xxxxxxxxx> · Sat, 19 Nov 2011 15:14:37 +0530

Hi,

Database streaming is not taking place. The WAL segment that slave is looking for does not exist on Master.
Both Master and Slave are EC2 instances with Postgres version 9.0.04 and Ubuntu 10.04. As per my understanding, DB replication was stalled for around 3 months. On Master new 16 MB WAL is created in every 2-5 minutes.
For replication, I am following link:
http://wiki.postgresql.org/wiki/Streaming_Replication
I am also referring:
http://www.postgresql.org/docs/9.0/static/continuous-archiving.html

http://www.depesz.com/index.php/2010/03/11/setting-wal-replication/
Before starting backup, I ensured the following:
- On Slave I cleared contents of 'pg_xlog/*'.

- Both master and Slave have following in postgresql.conf:
wal_level = archive
hot_standby = off
- In postgresql.conf master has: 
max_wal_senders = 5
wal_keep_segments = 10
- On slave recovery.conf has following 3 parameters:

standby_mode = 'on'
primary_conninfo = 'host=10.218.61.143 port=5432 user=postgres'
trigger_file = '/data/db/trigger_failover'
I used following commands for backup. And as soon as backup finished, I immediately started postgres on Slave.

psql -c "SELECT pg_start_backup('label', true)";
rsync -av --progress /data/db/main/ 10.40.89.9:/data/db/main/ --exclude 'pg_log/*' --exclude 'pg_xlog/*' --exclude postmaster.pid --exclude pg_hba.conf --exclude postgresql.conf;

psql -c "SELECT pg_stop_backup()";
On Slave I see following process running:
$ ps -ef | grep postgres
postgres  1895     1  0 Nov18 ?        00:00:00 /usr/lib/postgresql/9.0/bin/postgres -D /data/db/main -c config_file=/etc/postgresql/9.0/main/postgresql.conf

postgres  1896  1895  0 Nov18 ?        00:00:00 postgres: startup process   waiting for 0000000100000B110000000D
On Slave, log showd that it is unable to find the requested WAL segment
$ tail /var/log/postgresql/postgresql-9.0-main.log

2011-11-19 07:09:50 UTC LOG:  streaming replication successfully connected to primary
2011-11-19 07:09:50 UTC FATAL:  could not receive data from WAL stream: FATAL:  requested WAL segment 0000000100000B110000000D has already been removed
I confirmed that requested WAL segment 0000000100000B110000000D doesn't exist on Master.
On Master, process listing shows:
$ ps -ef | grep postgres
postgres 25395 25389  0 Nov14 ?        00:00:06 postgres: archiver process   last was 0000000100000B1F00000081
Log on master also indicate that requested WAL segment was removed:
$ tail postgresql-2011-11-18_221110.csv
2011-11-18 23:15:01.355 PST,"postgres","",20523,"10.40.89.9:46157",4ec75775.502b,1,"authentication",2011-11-18 23:15:01 PST,5/703238,0,LOG,00000,"replication connection authorized: user=postgres host=10.40.89.9 port=46157",,,,,,,,,""

2011-11-18 23:15:01.356 PST,"postgres","",20523,"10.40.89.9:46157",4ec75775.502b,2,"startup",2011-11-18 23:15:01 PST,5/0,0,FATAL,58P01,"requested WAL segment 0000000100000B110000000D has already been removed",,,,,,,,,""
On Slave, I even tried deleting everything under /data/db/main, and took backup again but the issue still persists.
It seems it is not an issue because slow Slave is not able to catch to master. Because,
1) This happens as soon as Slave DB is started. So slave doesn't even get the first WAL file.

2) Both machines are in same zone of EC2 and backup happens at fairly good speed. So network connectivity issues are also ruled out.
I searched on various forums, where people encountered similar error, however in all such issues WAL file existed on Master. In this case Master is not retaining the WAL file required by the Slave.
I am unable to understand as to why Master is not retaining the WAL files. Any pointer/suggestions would be helpful.
Thanks for attention.

Ashish