Search Postgresql Archives

How to recover correctly master and replica using backup made by pg_basebackup?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello.

I try to use pg_basebackup to make backup and recover master and replica from it. I recover master with action "promote". Next, I recover replica from the same backup with action "shutdown". After it, I start replica with configured connection and replication slot to the master. I get error, because replica reads and applies
all WAL of irrelevant timeline, which contains logs after restore point.

I wrote Perl script for pg_basebackup's TAP tests. It is attached as file "050_check_replication.pl". It works on PostgreSQL 17, commit 7d85d87f4d5c35fd5b2d38adaef63dfbfa542ccc after applying small patch, which adds possibility to start node with shutdown option.
Patch is attached as file "add_node_start_with_shutdown.diff".

Script generates situation when timeline changes several times. Small schema may
explain it:

Timeline 1:
  Create the first backup
  Create restore point 'rp_repl_bug'
Timeline 1 -> 2:
  Recovery with promote using the first backup and 'rp_repl_bug'
  Create the second backup
  Create restore point 'rp_repl_bug2'
Timeline 2 -> 3:
  Recovery with promote using the second backup and 'rp_repl_bug2'
Timeline 1 -> 4:
  Recovery with promote using the first backup and 'rp_repl_bug'

After it I get "00000004.history" file, which looks strange:
1    0/3000090    at restore point "rp_repl_bug"
2    0/5000090    at restore point "rp_repl_bug2"
3    0/3000090    at restore point "rp_repl_bug"

Next, I try to restore replica from the first backup. Replica did not find WAL for
timeline 4 and downloaded and applied WAL for timeline 2.

cp: cannot stat '${ARCHIVE_PATH}/000000040000000000000003': No such file or directory
LOG:  redo starts at 0/2000028
cp: cannot stat '${ARCHIVE_PATH}/000000040000000000000003': No such file or directory '${ARCHIVE_PATH}/000000020000000000000003' -> '${REPLICA_PGDATA}/pg_wal/RECOVERYXLOG'
LOG:  restored log file "000000020000000000000003" from archive
LOG:  completed backup recovery with redo LSN 0/2000028 and end LSN 0/2000120
LOG:  consistent recovery state reached at 0/2000120
LOG:  recovery stopping at restore point "rp_repl_bug", time 2024-11-05 22:00:16.037126+03

Than, at the next replica's start (with configured connection to the master and replication slot) replica can not start receiving WAL from master, because it applied all WAL from timeline 2 including all lines after LSN, where it should
switch timeline:

LOG:  entering standby mode
LOG:  redo starts at 0/2000028
LOG:  consistent recovery state reached at 0/3000090
LOG:  database system is ready to accept read-only connections
LOG:  started streaming WAL from primary at 0/4000000 on timeline 4
FATAL:  could not receive data from WAL stream: ERROR:  requested starting point 0/4000000 is ahead of the WAL flush position of this server 0/3002000

If I restart replica, it will not start:
LOG:  database system was shut down in recovery at 2024-11-05 22:00:18 MSK
LOG:  entering standby mode
FATAL:  requested timeline 4 is not a child of this server's history
DETAIL:  Latest checkpoint is at 0/3000090 on timeline 2, but in the history of the requested timeline, the server forked off from that timeline at 0/5000090.
LOG:  startup process (PID 2384880) exited with exit code 1

Questions:

1) Why does mirror not switch to the next timeline when it get LSN in WAL, which
there is in the history file? And why does mirror apply all WAL?
2) Why does 00000004.history contain info about switching from 2 and 3 timelines
even if the 4 timeline was switched from the first timeline?

Thank you in advance.
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 007571e948..e18eb3aace 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1112,7 +1112,7 @@ instead return a true or false value to indicate success or failure.
 
 =cut
 
-sub start
+sub _start
 {
 	my ($self, %params) = @_;
 	my $port = $self->port;
@@ -1150,11 +1150,23 @@ sub start
 		BAIL_OUT("pg_ctl start failed") unless $params{fail_ok};
 		return 0;
 	}
+}
 
+sub start
+{
+	my ($self) = @_;
+	$self->_start;
 	$self->_update_pid(1);
 	return 1;
 }
 
+sub start_with_enabled_shutdown
+{
+	my ($self) = @_;
+	$self->_start;
+	return 1;
+}
+
 =pod
 
 =item $node->kill9()

Attachment: 050_check_replication.pl
Description: Perl program


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux