On Thu, Oct 4, 2018 at 6:13 AM Anders Wegge Keller <wegge@xxxxxxxx> wrote:
Is there a reliable way to detect, on the standby, that continuous recovery
have stopped for any reason? I'm adding a postgres 9.2(1) database to an
industrial control system that have to run unattended. Thus, I need to plan
for some really exotic failure modes, like e.g. loss of network connectivity
in an indefinitely long period.
On the master system, I can check for the existence of the proper row in
pg_stat_replication. If that is missing, I can report a degraded status on
the whole system. However, on the standby side, I've been unable to find
something similar. Especially the failure mode where the standby return from
a long period of no network connectivity, where the master no longer have
the WAL segments needed to catch up. Snooping for FATAL in
pg_log/Postgres-xxx is error prone, and I'd rather not go the route of
polling the master system for a situation that should be detectable on the
standby instance.
--
//Wegge
1. Yes, I know it's no longer supported, but that's the way of redhat.
Checking replication status from just the replica itself isn't that
straight forward. This makes sense, though, since how is the replica
supposed to know that it's behind if it doesn't compare itself to
something else? PostgreSQL replication works by simply replaying the WAL
stream from another database. When that WAL stream stops, the
replication stops. This could be normal if the primary hasn't gotten any
writes.
I wrote a brief blog post on monitoring replica lag:
There
is a query you can run on the replica, but as I said above and in the
blog post, this can cause false positives when there's no writes on the
primary. But if you're always supposed to be getting writes to the
primary, it's the easiest method to check, and could even help indicate
when there's a problem if your primary isn't getting writes as it should
be. I highly recommend also monitoring your primary for byte lag as well, so you can at least know when the replica is throwing a false positive.
Another option would be, as you
hinted, to watch the replica for expected recent data. But again, if
there's no writes, it can cause false positives. So it's really not much
different than just checking for the last WAL replay.
If
you're worried about the replica falling behind due to network outages
or something similar, I recommend looking into WAL shipping in addition
to streaming replication. If postgres doesn't get a successful run
result from the archive_command, it keeps the WAL file around until it
does. You will have to monitor for disk space usage issues on the
primary if it backs up too much. If this happens a lot, you may be
better off shipping them to a third-party server, preferably where
you're storing your backups. You can then have the replica fetch its
WAL stream from there if streaming replication fails. pgbackrest is one
tool that can do this for you, as well as help manage backups.
https://pgbackrest.org/
If you're able to
upgrade to 9.4, postgres did introduce replication slots to help the
primary actually be aware of the state of its replicas and keep WAL
files around automatically until all its replicas are caught up. While 9.2 is the default version of PG that comes with Redhat/CentOS, the community does provide repositories with more recent rpm versions. https://www.postgresql.org/download/linux/redhat/ I would highly recommend looking into this since Redhat is not going to be updating their default version for a very long time.