Streaming Replication Lag and pg_last_xlog_receive_location() question

Don Seiler <don@xxxxxxxxx> · Mon, 23 Jul 2018 11:07:51 -0500

Good morning listers,
The past couple of weeks I've observed lag to my remote DR standby for no apparent reason. The local HA standby is caught up just fine, and there is no extraordinary write activity on the primary this morning.

Running this query from the primary:

select client_addr as client, usename as user, application_name as appname,
    state, sync_state as mode,
    (pg_xlog_location_diff(pg_current_xlog_location(), sent_location) / 1024)::int as pending,
    (pg_xlog_location_diff(sent_location, write_location) / 1024)::int as write,
    (pg_xlog_location_diff(write_location, flush_location) / 1024)::int as flush,
    (pg_xlog_location_diff(flush_location, replay_location) / 1024)::int as replay,
    (pg_xlog_location_diff(pg_current_xlog_location(), replay_location) / 1024)::int as total_lag
from pg_stat_replication;
    client     |    user    |   appname   |   state   | mode  | pending | write | flush | replay | total_lag
---------------+------------+-------------+-----------+-------+---------+-------+-------+--------+-----------
 10.0.100.99   | replicator | walreceiver | streaming | async |       0 |     0 |     0 |     24 |        24
 10.1.100.99   | replicator | walreceiver | streaming | async |  616433 |   768 |     0 |    128 |    617329

The 10.1 network is the DR site. You can see the "pending" column building up, representing WAL data that hasn't been sent over the network yet. Given that the local replica is doing just fine, my assumption is that it is something wrong with the networking between data centers. My network admin tells me everything is fine and that the circuit is nowhere near capacity. It definitely doesn't seem to be a throughput issue in this case given that there is no extraordinary load on the primary. The lag will wax and wane and be gone for a few hours and then return. Last time it did this for 2-3 days and then went away altogether until this weekend again. This "unexplained lag" just started last week. We've previously had expected lag when running large DELETE operations, but the DB catches up from those and then we'll see this unexpected lag hours or days later.

None of our other streaming replica setups on other DB clusters are having problems, but they are even much more quiet than this one.

I'm curious what tools I might be missing from my brain to properly diagnose this as either a network issue or something else and focus my effort appropriately. Nothing changed in the DB or network stack configurations that I'm aware of in the past two weeks.

One other thing: when I ran location check on the lagging standby, I observed an interesting phenomenon. The value of pg_last_xlog_receive_location() always ends in "6000" when it is lagging like this. When things are fine it is any seemingly random set of characters (confirmed on the local HA replica). 

             now               | receive_location | replay_location
-------------------------------+------------------+-----------------
 2018-07-23 10:49:03.64908-05  | 1679/6F326000    | 1679/6F325FE8

              now              | receive_location | replay_location
-------------------------------+------------------+-----------------
 2018-07-23 10:49:07.331111-05 | 1679/6F3A6000    | 1679/6F3A5FF0

              now              | receive_location | replay_location
-------------------------------+------------------+-----------------
 2018-07-23 10:49:11.706098-05 | 1679/6F446000    | 1679/6F444180

              now              | receive_location | replay_location
-------------------------------+------------------+-----------------
 2018-07-23 10:49:15.562082-05 | 1679/6F4C6000    | 1679/6F4C5FC0

              now              | receive_location | replay_location
-------------------------------+------------------+-----------------
 2018-07-23 10:49:18.394078-05 | 1679/6F546000    | 1679/6F545FF8

              now              | receive_location | replay_location
-------------------------------+------------------+-----------------
 2018-07-23 10:50:28.328077-05 | 1679/6FF06000    | 1679/6FF04680

It was also like this last week when we saw similar "mystery" lag, receive location ending in "6000" every time. Just curious about that one.

Thanks,
Don.

-- 
Don Seiler
www.seiler.us