Re: Replication Lag Causes

Mike Wilson <mfwilson@xxxxxxxxx> · Sun, 2 Nov 2014 13:16:24 -0800

Thanks for the information Greg.
Unfortunately modifying the application stack this close to the holiday season won’t be an option so I’m left with:
   1) Trying to optimize the settings I have for the query mix I have.
   2) Optimize any long running DML queries (if any) to prevent lag due to locks.
   3) Getting a better understanding of “what” causes lag.

#3 will probably be central to at least minimizing lag during heavy DML load.  If anyone has a good resource to describe when a slave would start to lag potentially that would help me hunt for the cause.  I know long running DML on the master may cause lag but I’m uncertain as to the specifics of why.  During periods of lag we do have more DML than usual running against the master but the queries themselves are very quick although there might be 20-30 DML operations per second against some of our central tables that store user account information.  Even under heavy DML the queries still return in under a second.  Possibly a large volume of of short running DML cause replication lag issues for large tables (~20M)?

Thanks again for your help.  BDR looks interesting but probably too cutting edge for my client.

Mike Wilson

On Nov 2, 2014, at 12:33 PM, Greg Spiegelberg <gspiegelberg@xxxxxxxxx> wrote:

Hi Mike,
Sounds very familiar.  Our master fans out to 16 slaves (cascading) and we had great success with segregating database queries to different slaves and some based on network latency.  I'd suggest, if possible, alter the application to use the slave for simple SELECT's and FUNCTION's performing SELECT-like only work while limiting those applications and queries that perform DML to the master (obviously).  If the load on the slave increases too much, spin up another slave.  I'd mention from experience that it could be the load on the slave that is giving the appearance of replication lag.  This is what led us to having (1) slave per application.

There is also the BDR multi-master available in 9.4beta if you're wanting to live on the edge.

-Greg

On Sat, Nov 1, 2014 at 4:33 PM, Mike Wilson <mfwilson@xxxxxxxxx> wrote:
I have two 9.3.4 PG instances that back a large internet website that has very seasonal traffic and can generate large query loads.  My instances are in a master-slave streaming replication setup  and are stable and in general perform very well.  The only issues we have with the boxes is that when the master is busy the slave may start to lag excessively.  I can give specifics as to what heavily loaded means and additionally the postgresql.conf for both boxes but my basic questions are:   * What causes streaming replication lag to increase?  
   * What parameters can be tuned to reduce streaming replication lag?
   * Can a loaded slave affect lag adversely?
   * Can increasing max_wal_senders help reduce lag?

The reason I ask this is that as mentioned above the servers are stable and are real troopers in general as they back a very popular web site that puts the master under heavy seasonal load at times.  At those times though we see an almost exponential growth in streaming replication lag compared to load on the master.  

For example, the master is a very beefy Solaris:
   * 4 Recent Intel Zeons (16 physical cores)
   * 256 GB of ECC RAM
   * 12 TB of ZFS (spindle and SSD internal storage)
   * DB on disk size is 2TB
   * ZFS ARC cache of roughly 250G.
   * ZFS ARC2/ZIL configured for SSD’s (this is awesome by the way)

Basic PG Config:
   shared_buffers = 2GB
   work_mem = 128MB
   max_connections = 1700 (supports roughly 100 web servers)
   wal_keep_segments = 256 (roughly enough for 24 hours of operation under heavy load)
   wal_sender_timeout = 60s
   replication_timeout=(not set)
   wal_receiver_status_interval=10s
   max_wal_senders=6
   * wal archiving is off
   * 98% of the queries on the master complete in under 500ms.
   * No hung or very long running queries in general.

The master on a normal day maintains a load of about 0.5, during which replication lag to the slave is in hundreds milliseconds.  When the production db server is heavily hit though the load may go as high as 4 on the master and the streaming replication lag may increase to more than 2 hours relatively quickly.  Load on the slave is generally below 1 even when the master is heavily loaded.  The traffic to the master is primarily read with about 10% DML (new users, purchase records, etc).  DML statements increase proportionally when under load though.  The master and slave are connected via dedicated 10G fiber link and even under heavy load the utilization of the link is nowhere near close to saturation.  BTW, the slave does run some reported related queries throughout the day that might take up to a minute to complete.

I have the task of figuring out why this otherwise healthy DB starts to lag so badly under load and if there is anything that we could do about it.  I’ve been wondering particularly if we should up the max_wal_senders but from the docs it is unclear if that would help.  In my testing with pg_bench on our dev boxes which were the previous production hardware for these servers I have determined that it doesn’t take much DML load on the master to get the slave to start lagging severely.  I was wondering if this was expected and/or some design consideration?  Possibly streaming replication isn’t meant to be used for heavily hit databases and maintain small lag times?  I would like to believe that the fault is something we have done though and that there is some parameter we could tune to reduce this lag.

Any recommendations would be very helpful.  

Mike Wilson
Predicate Logic Consulting