On Mon, May 14, 2018 at 1:31 PM, Johannes Truschnigg <johannes@xxxxxxxxxxxxxxx> wrote:
Do you happen to have historical host-monitoring data available for when the
replication interruption happened? You should definitely check for CPU (on
both sides) and I/O (on the receiver/secondary) saturation.
We do have grafana and zenoss info going way back, I'll see if I can get a login there.
I remember when we first set up streaming replication initially, back then
under postgres 9.0, the replication connection defaulted to using TLS/SSL; at
the time with SSL/TLS compression enabled. The huge extra work that this
incurred on the CPUs involved regularly made the WAL sender on the primary
break streaming replication because it couldn't possibly keep up with the data
that was being pushed into it encrypted & compressed TCP connection over a 10G
link. (Linux's excellent perf tool proved invaluable in determining the exact
cause for the high CPU load inside the postgres processes; once we had
re-compiled OpenSSL without compression, the problem went away.)
Now of course modern TLS library versions don't implement compression any
more, and the streaming ciphers are most probably hardware accelerated for
your combination of hard- and software, but the lesson we learned back then
may still be worth keeping in mind...
Very interesting read. I just re-examined all of our settings in postgresql.conf, pg_hba.con and recovery.conf and we don't have SSL enabled anywhere there. I'm going to assume that this isn't a bottleneck in our case then.
Other than that... have you verified that the network link between your hosts
can actually live up to you and your manager's expectations in terms of
bandwidth delivered? iperf3 could help verify that; if the measured bandwidth
for a single TCP stream lives up to what you'd expect, you can probably rule
out network-related concerns and concentrate on looking at other potential
bottlenecks.
Don.
Don Seiler
www.seiler.us
www.seiler.us