streaming slaves can't keep up?

Ben Chobot <bench@xxxxxxxxxxxxxxx> · Tue, 31 Mar 2020 10:52:27 -0700

We have a few busy 9.5 dbs, both streaming to a few slaves each. The 
master and slaves are identical hardware and are getting no small amount
 of load - about 45k transactions/s on the master and ~36k 
transactions/s on the slave actively serving clients. During these busy 
times, queries are all fairly responsive (mostly well under <1s) on 
both master and slave, and according to pg_stat_replication, replication
 is mostly good - the 
flush_location for all slaves seems quite up to date. But the 
replay_location on those busy slaves falls behind by quite a lot (over 
an hour behind), and this is a problem. On the slaves which aren't 
taking client load, their replay_location remains close to the 
flush_location.

Does it make sense that the reason this is happening is because all 
those queries, which are quick but quite numerous, are causing the 
replay to slow down? If so, my hope is that we can simply throw more 
slaves at the problem, reducing the amount of queries and therefore 
allowing the replication to not get blocked as often. But if that theory
 is nonsense, I'm going to need a different solution.