Re: Very long standby database startup after doing pg_basebackup

Laurenz Albe <laurenz.albe@xxxxxxxxxxx> · Fri, 23 Mar 2018 10:42:47 +0100

Mario Mahovlić wrote:
> Thank you for the reply, we will try using perf to see what exactly is going on.
> As for the log file all we get from it (even when log level is set to debug5) is this:
> 
> > >>> Starting postgres...
> > LOG:  database system was interrupted; last known up at 2018-03-20 08:25:18 UTC
> > FATAL:  the database system is starting up
> > psql: FATAL:  the database system is starting up
> > >>> Db replication is still not accessible on *NODE* (trying 17255 times)
> > ..... * This goes on for 6 hours * 
> > LOG:  entering standby mode
> > LOG:  redo starts at 4B/A000028
> > LOG:  consistent recovery state reached at 4B/A000130
> > LOG:  database system is ready to accept read only connections
> > LOG:  started streaming WAL from primary at 4B/B000000 on timeline 35
> > DEBUG:  CommitTransaction
> > DEBUG:  name: unnamed; blockState:       STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children: 
> > DEBUG:  StartTransactionCommand
> > DEBUG:  StartTransaction
> > ...
> 
> Lines with  >>> are from our own scripts we use for startup and database management.
> 
> So basically on startup all we get in logs is:
> LOG:  database system was interrupted; last known up at 2018-03-20 08:25:18 UTC 
> FATAL:  the database system is starting up  -> every time we query db until it starts up successfully
> 
> Is there any other way to log more detail during postgres startup?
> 
> As i said in initial mail we use  -X stream option when doing pg_basebackup, here is exact pg_basebackup command that is being run:
> 
> > INFO: executing: '/usr/lib/postgresql/9.6/bin/pg_basebackup -l "repmgr base backup"  -D /var/lib/postgresql/data -h *NODE_IP* -p *PORT* -U *USER* -c fast -X stream '
> 
> So yes, we do copy the WAL files, and there is at most 30GB of them (that is how much we keep), however even if database had to restore all of them 6 hours seems a bit much.

Sorry, I missed the pg_basebackup command.

The strange thing is that you see the message "entering standby mode" only six
hours after startup.  This message is written when recovery *starts*, so
(as expected) it is not that recovery takes long.

By the way, you should add %m to your log_line_prefix so we see *when* messages
are logged.

>From the messages you show, it seems like the six hours are spent in StartupXLOG().

The only potentially expensive thing there is SyncDataDirectory() which forces
the data directory to disk. Ok, but 6 hours?

I think it would be enlightening if you could strace the startup process and
see what it does during these 6 hours.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com