Re: Replication lag from transaction logs

Keith <keith@xxxxxxxxxxx> · Mon, 18 Jun 2018 14:27:41 -0400

Sorry, I mean pg_controldata, not pg_basebackup.

On Mon, Jun 18, 2018 at 1:57 PM, Keith <keith@xxxxxxxxxxx> wrote:
Yes, in general using pg_basebackup is not ideal to monitor for replica lag. I was just providing a means to do so from an offline instance. For normal monitoring, if you want to monitor byte lag, you can query that from the primary system.  Or you can also check for lag from the replica as well.

https://www.keithf4.com/monitoring_streaming_slave_lag/

Keith

On Mon, Jun 18, 2018 at 1:03 PM, Scott Ribe <scott_ribe@xxxxxxxxxxxxxxxx> wrote:

> On Jun 18, 2018, at 9:56 AM, Debraj Manna <subharaj.manna@xxxxxxxxx> wrote:

> 

> Thanks Keith this is useful.

> 

> One more query if I need to know that if a have fallen too far behind and the WAL is not available. I guess I can do this. Let me know if my understanding is correct.

>       • Run pg_controldata <DATA_DIR> on the slave node which has been down for long.

>       • It will output the details about the WAL along with the WAL file name from where it will start the replication. (Field to look for in the output– 'Latest checkpoint's REDO WAL file')

>       • Then check if the file mentioned in ` 'Latest checkpoint's REDO WAL file' is present in master `pg_wal` directory. If not then slave have fallen too far behind and will not be able to recover from WAL.  

You could also try bringing the slave back up, and monitoring the log for the error about needed WAL file not being available--this avoids the race condition between checking that all WAL is available and restarting the slave.

--

Scott Ribe

scott_ribe@xxxxxxxxxxxxxxxx

https://www.linkedin.com/in/scottribe/