Weird spikes in delay for async streaming replication on 9.1

"David F. Skoll" <dfs@xxxxxxxxxxxxxxxxxx> · Sat, 14 Feb 2015 11:00:50 -0500

Hi,

I have a two-database cluster.  The machines are geographically
separated and the nature of my application is that many read-only
queries can tolerate being "behind the times" by a few seconds.  So
machines near the hot-standby connect to the hot-standby for these
delay-tolerant queries in order to reduce traffic over the relatively
slow link between geographical locations.

I have a monitoring script that tests the actual delay for a
transaction on the master to appear on the hot-standby.  Every few
minutes, my script runs an update on the master and then sits in a
loop checking how long it takes to appear on the hot-standby.  99% of
the time, it's less than a second.

But every once in a while, the time spikes dramatically, to hundreds
or thousands of seconds, and that's too long... the delay-tolerant
queries are not *that* delay-tolerant, so we switch to sending them
all to the master.

See the graph: http://ibin.co/1rdm4ekiWmpM

I've tried to figure out what causes this, and the only events I can
find that correlate are a pg_dump on the master and possibly some
autovacuum jobs kicking off.  So my questions:

1) Can a long-running transaction on the master block subsequent
transactions from being consumed on the hot-standby, or am I totally
out to lunch?

2) If (1) is correct, is it still true in 9.4?

3) If (1) is false, does anyone have plausibly explanations for what
I'm seeing?  I don't think it's the link between the sites, because we
also monitor that and it seems to be fine.

Regards,

David.

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin