Re: improve wals replay on secondary

Fabio Pardi <f.pardi@xxxxxxxxxxxx> · Mon, 27 May 2019 13:04:51 +0200

If you did not even see this messages on your standby logs:

restartpoint starting: xlog

then it means that the checkpoint was even never started.

In that case, I have no clue.

Try to describe step by step how to reproduce the problem together with
your setup and the version number of Postgres and repmgr, and i might be
able to help you further.

regards,

fabio pardi

On 5/27/19 12:17 PM, Mariel Cherkassky wrote:
> standby_mode = 'on'
> primary_conninfo = 'host=X.X.X.X user=repmgr  connect_timeout=10 '
> recovery_target_timeline = 'latest'
> primary_slot_name = repmgr_slot_1
> restore_command = 'rsync -avzhe ssh
> postgres@x.x.x.x:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ;
> gunzip < /var/lib/pgsql/archive/%f > %p'
> archive_cleanup_command = '/usr/pgsql-9.6/bin/pg_archivecleanup
> /var/lib/pgsql/archive %r'
> 
> ‫בתאריך יום ב׳, 27 במאי 2019 ב-12:29 מאת ‪Fabio Pardi‬‏
> <‪f.pardi@xxxxxxxxxxxx <mailto:f.pardi@xxxxxxxxxxxx>‬‏>:‬
> 
>     Hi Mariel,
> 
>     let s keep the list in cc...
> 
>     settings look ok.
> 
>     what's in the recovery.conf file then?
> 
>     regards,
> 
>     fabio pardi
> 
>     On 5/27/19 11:23 AM, Mariel Cherkassky wrote:
>     > Hey,
>     > the configuration is the same as in the primary : 
>     > max_wal_size = 2GB
>     > min_wal_size = 1GB
>     > wal_buffers = 16MB
>     > checkpoint_completion_target = 0.9
>     > checkpoint_timeout = 30min
>     >
>     > Regarding your question, I didnt see this message (consistent recovery
>     > state reached at), I guess thats why the secondary isnt avaialble
>     yet..
>     >
>     > Maybe I'm wrong, but what I understood from the documentation- restart
>     > point is generated only after the secondary had a checkpoint wihch
>     means
>     > only after 30 minutes or after max_wal_size is reached ?  But
>     still, why
>     > wont the secondary reach a consisteny recovery state (does it
>     requires a
>     > restart point to be generated ? )
>     >
>     >
>     > ‫בתאריך יום ב׳, 27 במאי 2019 ב-12:12 מאת ‪Fabio Pardi‬‏
>     > <‪f.pardi@xxxxxxxxxxxx <mailto:f.pardi@xxxxxxxxxxxx>
>     <mailto:f.pardi@xxxxxxxxxxxx <mailto:f.pardi@xxxxxxxxxxxx>>‬‏>:‬
>     >
>     >     Hi Mariel,
>     >
>     >     if i m not wrong, on the secondary you will see the messages you
>     >     mentioned when a checkpoint happens.
>     >
>     >     What are checkpoint_timeout and max_wal_size on your standby?
>     >
>     >     Did you ever see this on your standby log?
>     >
>     >     "consistent recovery state reached at .."
>     >
>     >
>     >     Maybe you can post your whole configuration of your standby
>     for easier
>     >     debug.
>     >
>     >     regards,
>     >
>     >     fabio pardi
>     >
>     >
>     >
>     >
>     >     On 5/27/19 10:49 AM, Mariel Cherkassky wrote:
>     >     > Hey,
>     >     > PG 9.6, I have a standalone configured. I tried to start up a
>     >     secondary,
>     >     > run standby clone (repmgr). The clone process took 3 hours
>     and during
>     >     > that time wals were generated(mostly because of the
>     >     checkpoint_timeout).
>     >     > As a result of that, when I start the secondary ,I see that the
>     >     > secondary keeps getting the wals but I dont see any messages
>     that
>     >     > indicate that the secondary tried to replay the wals. 
>     >     > messages that i see :
>     >     > receiving incremental file list
>     >     > 000000010000377B000000DE
>     >     >
>     >     > sent 30 bytes  received 4.11M bytes  8.22M bytes/sec
>     >     > total size is 4.15M  speedup is 1.01
>     >     > 2019-05-22 12:48:10 EEST  60942  LOG:  restored log file
>     >     > "000000010000377B000000DE" from archive
>     >     > 2019-05-22 12:48:11 EEST db63311  FATAL:  the database system is
>     >     starting up
>     >     > 2019-05-22 12:48:12 EEST db63313  FATAL:  the database system is
>     >     > starting up 
>     >     >
>     >     > I was hoping to see the following messages (taken from a
>     different
>     >     > machine) : 
>     >     > 2019-05-27 01:15:37 EDT  7428  LOG:  restartpoint starting: time
>     >     > 2019-05-27 01:16:18 EDT  7428  LOG:  restartpoint complete:
>     wrote 406
>     >     > buffers (0.2%); 1 transaction log file(s) added, 0 removed, 0
>     >     recycled;
>     >     > write=41.390 s, sync=0.001 s, total=41.582 s; sync file
>     >     > s=128, longest=0.000 s, average=0.000 s; distance=2005 kB,
>     >     estimate=2699 kB
>     >     > 2019-05-27 01:16:18 EDT  7428  LOG:  recovery restart point at
>     >     4/D096C4F8
>     >     >
>     >     > My primary settings(wals settings) : 
>     >     > wal_buffers = 16MB
>     >     > checkpoint_completion_target = 0.9
>     >     > checkpoint_timeout = 30min
>     >     >
>     >     > Any idea what can explain why the secondary doesnt replay
>     the wals ?
>     >
>     >
>