Streaming Replication: Observations, Questions and Comments

Samba <saasira@xxxxxxxxx> · Wed, 24 Aug 2011 21:03:17 +0530

Hi all,

We have a postgres-9.0 streaming replication set up where we keep the WAL 
segments on the master amounting to 10 GB so that we can survive longer 
periods of disconnect between master and slave. We do not 
use any shared storage space for archiving WAL logs. (the shared 
disk server may turn out to be another point of failure, which we would want to avoid) 

Here is our basic configuration parameters in :

postgresql.conf on master:

wal_keep_segments = 640  # previously 32   |  # in logfile segments, min 1, 16MB each

archive_mode = on               # allows archiving to be done     # (change requires restart)
archive_command = 'cp -v %p /archives/data/pgsql/pg_xlog/%f'  

postgresql.conf on slave:
wal_level = hot_standby
hot_standby = on

recovery.conf on slave:

standby_mode = 'on'

primary_conninfo = 'host=ip.add.ress port=5432 user=repman'

trigger_file = '/var/lib/pgsql/data/stop.replication'

restore_command = 'cp -i /archives/data/pgsql/pg_xlog/%f "%p"'

Master and Slave servers are sperated by thousands of miles and the network bandwidth comprises just an ordinary 1 Mbps DSL line. Both the master server and the slave server have the /archives partition mounted and synced with csync2 between master and the slave systems. I'm not sure if this is the correct way of configuring streaming 
replication, but I will explain what worked for us and what we are still 
left wanting with:

Under heavy inserts/updates/deletes on the master (load generated by 
stored procedures), we  noticed that the slave went far behind the 
master and resulted into breakage of replication. Hence we changed from 
32 log file segments to 640, which corresponds to 10 GB so that we can  survive either very heavy spikes of load or even a week's disconnect of the slave (although alarms would be 
raised appropriately for the same effect).

One strange thing I noticed is that the pg_xlogs on the master have outsized the actual data stored in the database by at least 3-4 times, which was quite surprising. I'm not sure if 'restore_command' has anything to do with it. I did not understand why transaction logs would need to be so many times larger than the actual size of the database, have I done something wrong somewhere? 

Another interesting fact we noticed is that once the replication is broken for some longer time ( walsender and walreceiver processes have died by this time), we had to restart not only the slave server but also the master server, which was quite strange. Shouldn't the master server start (if there is none running) walsender process the moment it receives a request for streaming? Similarly, why should the slave be restarted just to start replication again? why can't these two processes be independently started and stopped by the postmaster process as and when necessary as per the need to replicate or not?

Another thing that I noticed was that the slave server has logged that it is out of sync and hence closing replication but the master did not say anything about this breakage of replication.

So summing up the above, I would like to have some pointers to understand the following, which I think will benefit many others as well:

Do I need to 'archive' since I'm storing quite a significant number of logfile segments that can help sustain disconnect for almost a week?
Why did the xlog data accumulate so much as to be 3-4 times that of the actual size of the database growth during the same period?

why should we restart the postgres service on slave and/or master if  we need to join a slave back into replication after a long disconnect?
why is the master not complaining about the loss of replication? (of course, slave did complain about the disconnect or its inability to continue accepting data from master on account difference in xlog location).

Some of the above might be because of wrong configuration, while some may be give hints for future enhancements. I hope this will start a healthy discussion on the areas where streaming replication needs to be improved and strengthened.

Thanks and Regards,
Samba