Hi,
I will do the test tomorrow morning with your suggestions and will capture the logs and steps.
Thanks for the advice.
regards,
Marian
Op ma 11 mrt. 2019 om 15:55 schreef Jehan-Guillaume (ioguix) de Rorthais <ioguix@xxxxxxx>:
On Mon, 11 Mar 2019 12:17:59 +0100
M Jane <mskiesewetter@xxxxxxxxx> wrote:
> Hi,
> Thanks for your reply
> I will try to put in all the logging I collected when doing the switchover.
> I did the following:
> 1. Check if the standby is uptodate
> I ran the following query on the standby:
> SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(),
> pg_last_xlog_replay_location()) AS replication_delay_bytes;
> This resulted in 0.
This is not required. You don't need to check if some received XLOG are not
replayed on data yet. As far as the XLOG is on disk on the standby side, it will
be replayed before promoting by default. Quoting doc:
pg_last_xlog_receive_location():
Get last transaction log location received and synced to disk by streaming
replication.
In this step, you should check the standby is connected to the primary. Either
be listing your processes on both side, check pg_stat_replication, etc.
> 2. I shutdown the primary instance
> and checked the alertlog of the primary:
[...]
> < 2019-03-08 09:46:41.159 CET >LOG: received smart shutdown request
> < 2019-03-08 09:46:41.160 CET >LOG: autovacuum launcher shutting down
> < 2019-03-08 09:46:41.160 CET >FATAL: terminating connection due to
> administrator command
What is this FATAL? Did you issue some more administrative commands after the
initial "pg_ctl -m smart stop"?
I would recommend using "pg_ctl -m fast stop" and nothing else.
> 3. Then i promoted the slave with pg_ctl promote and checked the alertlog:
What was the last received LSN? What was the last WAL file the standby
retrieved from archives?
Do you have log lines before "LOG: selected new timeline ID: 4" ?
[...]
> 4. On the former primary I created a recovery.conf
> cat recovery.conf
> restore_command = 'cat archive_dir/%f | gunzip > %p'
"cat" is useless here. I recommend: "gzip -dc archive_dir/%f > %p"
> recovery_target_timeline = 'latest'
> standby_mode = on
> primary_conninfo = 'host=newprimary port=5432 user=replication password=xxxx'
> trigger_file = '/tmp/instance.trigger.5432'
You should not put the trigger file in a public accessible folder. Anyone can
triggers the promotion...
You should only rely on "pg_ctl promote" if possible.
> 5. And started te instance on the former primary
> alertlog former primary:
[...]
> < 2019-03-08 09:53:04.591 CET >FATAL: requested timeline 4 is not a child
> of this server's history
> < 2019-03-08 09:53:04.591 CET >DETAIL: Latest checkpoint is at 2/A8000028
> on timeline 3, but in the history of the requested timeline, the server
> forked off from that timeline at 2/A70000D0.
According to these log lines:
* during shutdown, old primary archived 0000000300000002000000A7 and wrote its
shutdown checkpoint as first xlog record in 0000000300000002000000A8. The
following command should show it:
pg_xlogdump "pg_xlog/0000000300000002000000A8"
* for some reasons, the standby never received any content from
0000000300000002000000A8. Neither be streaming rep, nor restored from archives
* the promoted standby forked its timeline from the last received LSN that was
in early 0000000300000002000000A7 file (2/A70000D0)
* you end up with a new master that forked (probably) right before the shutdown
checkpoint of the old master (as far as the next record in 2/A70000D0 was an
XLOG SWITCH).
In conclusion, the old master is right: the fork happen before its own last
writes.
> Am I missing something?
You should really take care of the first steps. Why the standby did not
received the shutdown checkpoint? Was it connected when you issued the "pg_ctl
stop"? Did the wal_sender processus been killed? A network issue? Why the
standby has not been able to restore the last segment from archives before
promoting?
I usually compare the last checkpoint LSN in the controldata from the old
master with what received the standby to promote, before promoting it.
> Please advise?
> Thanks in advance.
> regards,
> Marian
>
> Op vr 8 mrt. 2019 om 17:51 schreef Jehan-Guillaume (ioguix) de Rorthais <
> ioguix@xxxxxxx>:
>
> > On Fri, 8 Mar 2019 22:10:15 +0530
> > Shreeyansh Dba <shreeyansh2014@xxxxxxxxx> wrote:
> >
> > > Hi Marian,
> > >
> > > Yes, We have faced the same issue, generally, this type of issue occurs
> > due
> > > to missing of history file i.e(00000000.history) during the replication
> > > setup using the pg_basebackup.
> > >
> > > To resolve this type issue you need to create an empty history file with
> > > name 00000000.history on the master server then rebuild old master by
> > > initiating the pg_basebackup.
> >
> > Uh, what? Do you have further info about this? I fail to understand how
> > this is
> > related to OP question...
> >
> > > On Fri, Mar 8, 2019 at 8:11 PM M Jane <mskiesewetter@xxxxxxxxx> wrote:
> > ...
> > > > But bringing up the old primary as a slave doesn't work. I get errors (
> > > > like >FATAL: requested timeline 5 is not a child of this server's
> > history)
> > > > and the instances shuts down automatically.
> >
> > While doing a clean switchover, I never seen this (yet).
> >
> > How are you checking both servers are up-to-date? Do you check that the
> > standby
> > received everything from the old primary, including its shutdown
> > checkpoint,
> > before promoting?
> >
> > Make sure your standby staid connected to your old primary until the later
> > is
> > completely shutdown-ed. Nothing should disconnect them during the whole
> > process. And make sure to use mode fast or smart to stop the old primary.
> >
> > Are you able to share with us the restart checkpoint on the old
> > primary and the history file created by the new-primary? How far are the
> > restart LSN and timeline fork from each other?
> >
> > Regards,
> >
--
Jehan-Guillaume de Rorthais
Dalibo