Re: Detect when streaming replication stops streaming.

Keith Fiske <keith.fiske@xxxxxxxxxxxxxxx> · Thu, 4 Oct 2018 10:31:27 -0400

On Thu, Oct 4, 2018 at 6:13 AM Anders Wegge Keller <wegge@xxxxxxxx> wrote:
Is there a reliable way to detect, on the standby, that continuous recovery

have stopped for any reason? I'm adding a postgres 9.2(1) database to an

industrial control system that have to run unattended. Thus, I need to plan

for some really exotic failure modes, like e.g. loss of network connectivity

in an indefinitely long period. 

On the master system, I can check for the existence of the proper row in

pg_stat_replication. If that is missing, I can report a degraded status on

the whole system. However, on the standby side, I've been unable to find

something similar. Especially the failure mode where the standby return from

a long period of no network connectivity, where the master no longer have

the WAL segments needed to catch up. Snooping for FATAL in

pg_log/Postgres-xxx is error prone, and I'd rather not go the route of

polling the master system for a situation that should be detectable on the

standby instance. 

-- 

//Wegge

1. Yes, I know it's no longer supported, but that's the way of redhat. 

Checking replication status from just the replica itself isn't that
 straight forward. This makes sense, though, since how is the replica 
supposed to know that it's behind if it doesn't compare itself to 
something else? PostgreSQL replication works by simply replaying the WAL
 stream from another database. When that WAL stream stops, the 
replication stops. This could be normal if the primary hasn't gotten any
 writes.

I wrote a brief blog post on monitoring replica lag:

https://www.keithf4.com/monitoring_streaming_slave_lag/

There
 is a query you can run on the replica, but as I said above and in the 
blog post, this can cause false positives when there's no writes on the 
primary. But if you're always supposed to be getting writes to the 
primary, it's the easiest method to check, and could even help indicate 
when there's a problem if your primary isn't getting writes as it should
 be. I highly recommend also monitoring your primary for byte lag as well, so you can at least know when the replica is throwing a false positive.

Another option would be, as you 
hinted, to watch the replica for expected recent data. But again, if 
there's no writes, it can cause false positives. So it's really not much
 different than just checking for the last WAL replay.

If
 you're worried about the replica falling behind due to network outages 
or something similar, I recommend looking into WAL shipping in addition 
to streaming replication. If postgres doesn't get a successful run 
result from the archive_command, it keeps the WAL file around until it 
does. You will have to monitor for disk space usage issues on the 
primary if it backs up too much. If this happens a lot, you may be 
better off shipping them to a third-party server, preferably where 
you're storing your backups. You can then have the replica fetch its 
WAL stream from there if streaming replication fails. pgbackrest is one 
tool that can do this for you, as well as help manage backups. 
https://pgbackrest.org/

If you're able to 
upgrade to 9.4, postgres did introduce replication slots to help the 
primary actually be aware of the state of its replicas and keep WAL 
files around automatically until all its replicas are caught up. While 9.2 is the default version of PG that comes with Redhat/CentOS, the community does provide repositories with more recent rpm versions. https://www.postgresql.org/download/linux/redhat/  I would highly recommend looking into this since Redhat is not going to be updating their default version for a very long time.

-- 
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com