Re: Very long standby database startup after doing pg_basebackup

Mario Mahovlić <mariomahovlic@xxxxxxxxx> · Fri, 23 Mar 2018 10:07:11 +0100

Hello Laurenz,
Thank you for the reply, we will try using perf to see what exactly is going on.
As for the log file all we get from it (even when log level is set to debug5) is this:

>>> Starting postgres...
LOG:  database system was interrupted; last known up at 2018-03-20 08:25:18 UTC
FATAL:  the database system is starting up
psql: FATAL:  the database system is starting up
>>> Db replication is still not accessible on *NODE* (trying 17255 times)
..... * This goes on for 6 hours * 
LOG:  entering standby mode
LOG:  redo starts at 4B/A000028
LOG:  consistent recovery state reached at 4B/A000130
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 4B/B000000 on timeline 35
DEBUG:  CommitTransaction
DEBUG:  name: unnamed; blockState:       STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children: 
DEBUG:  StartTransactionCommand
DEBUG:  StartTransaction
...

Lines with 

>>> are from our own scripts we use for startup and database management.

So basically on startup all we get in logs is:

LOG:  database system was interrupted; last known up at 2018-03-20 08:25:18 UTC

FATAL:  the database system is starting up  -> every time we query db until it starts up successfully

Is there any other way to log more detail during postgres startup?

As i said in initial mail we use 

-X stream option when doing pg_basebackup, here is exact pg_basebackup command that is being run:

INFO: executing: '/usr/lib/postgresql/9.6/bin/pg_basebackup -l "repmgr base backup"  -D /var/lib/postgresql/data -h *NODE_IP* -p *PORT* -U *USER* -c fast -X stream '

So yes, we do copy the WAL files, and there is at most 30GB of them (that is how much we keep), however even if database had to restore all of them 6 hours seems a bit much.

Thank you for the help, kind regards.

On 22 March 2018 at 08:55, Laurenz Albe <laurenz.albe@xxxxxxxxxxx> wrote:
Mario Mahovlić wrote:

> However, our issue is not with primary cloning, but with database startup after

> clone/basebackup is done.

> Startup of standby database is successful but it takes around 6 hours, during

> which time all we get from DB is: FATAL:  the database system is starting up.

> We have noticed that this startup process uses only 1 cpu core on our machine

> which has 8.

>

> Our database size is around 570GB, and we are using schema per tenant data

> distribution, so we have separate schemas for each of the users of our

> application, as it suits our needs well.

> Because of this our database has about 40000 schemas with 5 tables per each

> schema.

> We suspect the reason behind this long startup is amount of data and/or large

> number of schemas/tables on database, however we would like to confirm that somehow.

>

> So our questions are following:

> What could be possible reasons for startup as long as this?

> How can we figure out what exactly does postgres startup process do all this time,

> is there any documentation on what exactly does postgres database do on startup as

> we have not found any info on this?

> Is there any way to speed up this startup process?

> Can we mitigate this startup time in cases when we do planned switchover for

> maintenance by using pg_rewind instead of pulling all data from primary again?

> So if we shut down the database gracefully and use pg_rewind will startup process

> be faster?

You could use "perf top" to see where the time is spent or have

a look at the PostgeSQL log file.

pg_basebackup creates a checkpoint when it starts, so checkpoints that happen

too rarely cannot be the problem.

Did you use "pg_basebackup -X stream" to copy the WAL files along with the data?

If yes, is there a lot of them?

If no, how do you restore the required WAL files?

Yours,

Laurenz Albe

--

Cybertec | https://www.cybertec-postgresql.com