RE: Slave stuck in recovery mode

"Nicolas Ross" <rossnick-lists@xxxxxxxxxxxx> · Sat, 9 Oct 2021 15:36:35 -0400

I ended up googling some more and found this :

https://www.enterprisedb.com/blog/be-sure-stop-your-backups

Which is exactly what was happening. Even though I had no backup running, I did the stop_backup command etc.

I planned a restart of the master server, and then re-cloned, I was then all OK !

Strange.

-----Message d'origine-----
De : Nicolas Ross <rossnick-lists@xxxxxxxxxxxx> 
Envoyé : 8 octobre 2021 19:16
À : pgsql-admin@xxxxxxxxxxxxxxxxxxxx
Objet : Slave stuck in recovery mode 

Hi !

We’ve been using postgres since some time now (since the
9.3 days).

I’ve got a pair of 9.6 server with 2 nodes, a primary and a
slave. We use repmgr to manage the cluster. When it was
installed, it was something like repmgr 4.x or even 3.

This week, for some reason, I had to rebuild the slave
instance. So I cloned the slave using a command like :

/usr/pgsql-9.6/bin/repmgr -h pgserver2.qualite -U repmgr -f
/etc/repmgr/9.6/repmgr.conf standby clone

After some time (it’s like 250 gigs, so it’s kinda an hour
or 2), the command ends.

If I start the postgres server on the slave with OS
systemcl script, it doesn’t return to the CLI (presumably
it waits for something).

In the log I see :

< 2021-10-08 16:16:47.861 EDT > LOG:  database system was
shut down in recovery at 2021-10-08 16:04:10 EDT
< 2021-10-08 16:16:47.877 EDT > LOG:  entering standby mode
< 2021-10-08 16:16:48.599 EDT > LOG:  redo starts at
13BF/CF000028
< 2021-10-08 16:16:52.899 EDT > LOG:  consistent recovery
state reached at 13BF/D53BA0F0
(Some time passes)
< 2021-10-08 16:46:10.363 EDT > LOG:  started streaming WAL
from primary at 13C9/8C000000 on timeline 1

After that, if I try to connect to the slave, I get :

FATAL:  the database system is starting up

No matter how long I wait (tried more than a day later).

During that time, the master still streams the wal to the
slave.

Notes :

That last log example was taken after trying to clone from
our barman server (tried with and without)

use_replication_slots is set to yes.

hot_standby is on on the primary, hence when cloned it is
also.

Before one of my clone command, I’ve tried cleaning all
residue of repmgr, ie remove the extension, re-register the
master, etc, still the same issue.

If I comment out hot_standby on the slave, it starts
normally, but still doesn’t allow connections.

Recovery.conf is :

standby_mode = 'on'
primary_conninfo = 'host=MASTERIP user=repmgr
application_name=SLAVENAME'
recovery_target_timeline = 'latest'
primary_slot_name = 'repmgr_slot_1'

Any help troubleshooting this would be appreciated !