Re: Replication Lag Causes

Mike Wilson <mfwilson@xxxxxxxxx> · Sun, 2 Nov 2014 12:58:31 -0800

Load on the slave is relatively light.  It averages about 1.0 due to some data ware house select queries running against it frequently.  Previously only the load on the master seems to have affected our replication lag no matter what the slave was doing.  
In thinking about this a bit more, the load on the master does cause increasing lag but only if the query mix begins to change to more DML than SELECTS.  Basically, the amount of DML is what really appears to cause the replication to lag.  This is an OLTP system backing a rather heavy commercial website where memberships are sold and when the purchase traffic increases that is when we start to see extreme lag develop on the slave.

CPU utilization on the slave during extreme lag is similar to normal operation even if the slave is lagging more than usual.

Thanks for the info on max_wal_senders.  That’s good to know.

Mike Wilson

On Nov 1, 2014, at 4:14 PM, Valentine Gogichashvili <valgog@xxxxxxxxx> wrote:

Hello Mike, 
what kind of load does the slave get?

what does the recovery process do on the slave during the times when lag is being observed? Does it use 100% of the CPU?

WAL can be replayed by only one process, so no need to increase the max_wal_senders.

Cheers,

-- Valentine Gogichashvili

On Sun, Nov 2, 2014 at 1:33 AM, Mike Wilson <mfwilson@xxxxxxxxx> wrote:
I have two 9.3.4 PG instances that back a large internet website that has very seasonal traffic and can generate large query loads.  My instances are in a master-slave streaming replication setup  and are stable and in general perform very well.  The only issues we have with the boxes is that when the master is busy the slave may start to lag excessively.  I can give specifics as to what heavily loaded means and additionally the postgresql.conf for both boxes but my basic questions are:   * What causes streaming replication lag to increase?  
   * What parameters can be tuned to reduce streaming replication lag?
   * Can a loaded slave affect lag adversely?
   * Can increasing max_wal_senders help reduce lag?

The reason I ask this is that as mentioned above the servers are stable and are real troopers in general as they back a very popular web site that puts the master under heavy seasonal load at times.  At those times though we see an almost exponential growth in streaming replication lag compared to load on the master.  

For example, the master is a very beefy Solaris:
   * 4 Recent Intel Zeons (16 physical cores)
   * 256 GB of ECC RAM
   * 12 TB of ZFS (spindle and SSD internal storage)
   * DB on disk size is 2TB
   * ZFS ARC cache of roughly 250G.
   * ZFS ARC2/ZIL configured for SSD’s (this is awesome by the way)

Basic PG Config:
   shared_buffers = 2GB
   work_mem = 128MB
   max_connections = 1700 (supports roughly 100 web servers)
   wal_keep_segments = 256 (roughly enough for 24 hours of operation under heavy load)
   wal_sender_timeout = 60s
   replication_timeout=(not set)
   wal_receiver_status_interval=10s
   max_wal_senders=6
   * wal archiving is off
   * 98% of the queries on the master complete in under 500ms.
   * No hung or very long running queries in general.

The master on a normal day maintains a load of about 0.5, during which replication lag to the slave is in hundreds milliseconds.  When the production db server is heavily hit though the load may go as high as 4 on the master and the streaming replication lag may increase to more than 2 hours relatively quickly.  Load on the slave is generally below 1 even when the master is heavily loaded.  The traffic to the master is primarily read with about 10% DML (new users, purchase records, etc).  DML statements increase proportionally when under load though.  The master and slave are connected via dedicated 10G fiber link and even under heavy load the utilization of the link is nowhere near close to saturation.  BTW, the slave does run some reported related queries throughout the day that might take up to a minute to complete.

I have the task of figuring out why this otherwise healthy DB starts to lag so badly under load and if there is anything that we could do about it.  I’ve been wondering particularly if we should up the max_wal_senders but from the docs it is unclear if that would help.  In my testing with pg_bench on our dev boxes which were the previous production hardware for these servers I have determined that it doesn’t take much DML load on the master to get the slave to start lagging severely.  I was wondering if this was expected and/or some design consideration?  Possibly streaming replication isn’t meant to be used for heavily hit databases and maintain small lag times?  I would like to believe that the fault is something we have done though and that there is some parameter we could tune to reduce this lag.

Any recommendations would be very helpful.  

Mike Wilson
Predicate Logic Consulting