Search Postgresql Archives

Streaming Replication: Observations, Questions and Comments

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

We have a postgres-9.0 streaming replication set up where we keep the WAL segments on the master amounting to 10 GB so that we can survive longer periods of disconnect between master and slave. We do not use any shared storage space for archiving WAL logs. (the shared disk server may turn out to be another point of failure, which we would want to avoid)

Here is our basic configuration parameters in :

postgresql.conf on master:
wal_keep_segments = 640  # previously 32   |  # in logfile segments, min 1, 16MB each
archive_mode = on               # allows archiving to be done     # (change requires restart)
archive_command = 'cp -v %p /archives/data/pgsql/pg_xlog/%f' 
 

postgresql.conf on slave:
wal_level = hot_standby
hot_standby = on

recovery.conf on slave:
standby_mode = 'on'
primary_conninfo = 'host=ip.add.ress port=5432 user=repman'
trigger_file = '/var/lib/pgsql/data/stop.replication'
restore_command = 'cp -i /archives/data/pgsql/pg_xlog/%f "%p"'

Master and Slave servers are sperated by thousands of miles and the network bandwidth comprises just an ordinary 1 Mbps DSL line. Both the master server and the slave server have the /archives partition mounted and synced with csync2 between master and the slave systems. I'm not sure if this is the correct way of configuring streaming replication, but I will explain what worked for us and what we are still left wanting with:

Under heavy inserts/updates/deletes on the master (load generated by stored procedures), we  noticed that the slave went far behind the master and resulted into breakage of replication. Hence we changed from 32 log file segments to 640, which corresponds to 10 GB so that we can  survive either very heavy spikes of load or even a week's disconnect of the slave (although alarms would be raised appropriately for the same effect).

One strange thing I noticed is that the pg_xlogs on the master have outsized the actual data stored in the database by at least 3-4 times, which was quite surprising. I'm not sure if 'restore_command' has anything to do with it. I did not understand why transaction logs would need to be so many times larger than the actual size of the database, have I done something wrong somewhere?

Another interesting fact we noticed is that once the replication is broken for some longer time ( walsender and walreceiver processes have died by this time), we had to restart not only the slave server but also the master server, which was quite strange. Shouldn't the master server start (if there is none running) walsender process the moment it receives a request for streaming? Similarly, why should the slave be restarted just to start replication again? why can't these two processes be independently started and stopped by the postmaster process as and when necessary as per the need to replicate or not?

Another thing that I noticed was that the slave server has logged that it is out of sync and hence closing replication but the master did not say anything about this breakage of replication.

So summing up the above, I would like to have some pointers to understand the following, which I think will benefit many others as well:
Some of the above might be because of wrong configuration, while some may be give hints for future enhancements. I hope this will start a healthy discussion on the areas where streaming replication needs to be improved and strengthened.

Thanks and Regards,
Samba

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux