Re: Detect when streaming replication stops streaming.

Anders Wegge Keller <wegge@xxxxxxxx> · Thu, 4 Oct 2018 23:06:30 +0200

On Thu, 4 Oct 2018 10:31:27 -0400
Keith Fiske <keith.fiske@xxxxxxxxxxxxxxx> wrote:

> Checking replication status from just the replica itself isn't that
> straight forward. This makes sense, though, since how is the replica
> supposed to know that it's behind if it doesn't compare itself to something
> else? PostgreSQL replication works by simply replaying the WAL stream from
> another database. When that WAL stream stops, the replication stops. This
> could be normal if the primary hasn't gotten any writes.
> 
> I wrote a brief blog post on monitoring replica lag:
> 
> https://www.keithf4.com/monitoring_streaming_slave_lag/
> 
> There is a query you can run on the replica, but as I said above and in the
> blog post, this can cause false positives when there's no writes on the
> primary. But if you're always supposed to be getting writes to the primary,
> it's the easiest method to check, and could even help indicate when there's
> a problem if your primary isn't getting writes as it should be. I highly
> recommend also monitoring your primary for byte lag as well, so you can at
> least know when the replica is throwing a false positive.
> 
> Another option would be, as you hinted, to watch the replica for expected
> recent data. But again, if there's no writes, it can cause false positives.
> So it's really not much different than just checking for the last WAL
> replay.

 I can see that I'm really overthinking things. Adding a table on the master
with a timestamp that's updated on a regular basis is enough to cover the
cases where I want the system to detect a lack of replication. Hat tip to
Fernando Souza, who made this suggestion in a mail that ended in my inbox,
rather than this list. And thanks for the blog post. 

> If you're worried about the replica falling behind due to network outages
> or something similar, I recommend looking into WAL shipping in addition to
> streaming replication. 

 If the systems are offline for long enough that the 10 WAL segments it's
configured to keep presently have rolled over, they're about three days out
of touch, and in a situation where it's a lighter network load to restart
from a fresh backup. 

> If postgres doesn't get a successful run result from
> the archive_command, it keeps the WAL file around until it does. You will
> have to monitor for disk space usage issues on the primary if it backs up
> too much. 

 The system we're replacing here was originally deployed on IBM 330 servers
in 1999. With modern server hardware, we're so ridiculously overprovisioned,
that we could keep at least a year of WAL around, before reaching 50% disk
use. So resource usage is luckily not one of the things I have to worry
about :)

...

> If you're able to upgrade to 9.4, postgres did introduce replication slots
> to help the primary actually be aware of the state of its replicas and keep
> WAL files around automatically until all its replicas are caught up. While
> 9.2 is the default version of PG that comes with Redhat/CentOS, the
> community does provide repositories with more recent rpm versions.
> https://www.postgresql.org/download/linux/redhat/  I would highly recommend
> looking into this since Redhat is not going to be updating their default
> version for a very long time.

 I'll take a look at 9.4 for when we make the next iteration of our "Base"
OS Media. With the present project, however, I'm stuck with what's in RHEL
7.5 And even going there was a sort of emergency deal because of hardware
incompatibilities. 

Thank you very much for taking your time to answer my question. I can see I
still have much to learn, when thinking about databases in this context. And
my apologies if I sound hostile towards your suggestions above. I value
them, but I also want to explain why some of them are impossible to follow.

-- 

//Wegge