We have a few busy 9.5 dbs, both streaming to a few slaves each. The
master and slaves are identical hardware and are getting no small amount
of load - about 45k transactions/s on the master and ~36k
transactions/s on the slave actively serving clients. During these busy
times, queries are all fairly responsive (mostly well under <1s) on
both master and slave, and according to pg_stat_replication, replication
is mostly good - the
flush_location for all slaves seems quite up to date. But the
replay_location on those busy slaves falls behind by quite a lot (over
an hour behind), and this is a problem. On the slaves which aren't
taking client load, their replay_location remains close to the
flush_location. Does it make sense that the reason this is happening is because all those queries, which are quick but quite numerous, are causing the replay to slow down? If so, my hope is that we can simply throw more slaves at the problem, reducing the amount of queries and therefore allowing the replication to not get blocked as often. But if that theory is nonsense, I'm going to need a different solution. |