Re: Cannot rebuild a standby server

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Well, I did finally get it working by adding -X s -c fast to the pg_basebackup command.

Kevin, if I didn't copy WALs over, the database still refused to start as it claimed it was looking for a one of the first specific files. Also, I've not seen any references to removing certain files like a backup_label file in the standby's data directory causing problems. The other files I removed were the old postgresql.pid file from the primary and a file called archiving_active, which I use for controlling whether postgresql writes WAL files or not. Seems a little funny to me that I've done this same procedure for over 4 months with no problems, and today was the first time it bit me.

On 6/20/2014 2:09 PM, Kevin Grittner wrote:
John Scalia <jayknowsunix@xxxxxxxxx> wrote:

In the true definition of insanity, I've tried to rebuild a standby
streaming replication server using the following steps several times:

1) ensure the postgresql data directory, /var/lib/pgsql/9.3/data, is empty.
2) run: pg_basebackup -h <primary server> -D /var/lib/pgsql/9.3/data
3) manually copy the WAL's from the primary server's pg_xlog directory
to the directory specified in the standby's recovery.conf restore_command.
Step 3 is enough to cause database corruption on the replica.

4) rm any artifacts from the standby's new data directory like the
backup_label file.
So is that.

5) copy the saved recovery.conf into the standby's data directory and check
it is accurate.
6) Start the database using "service postgresql-9.3 start"

Every time, however, the following appears in the pg_log/postgresql-Fri.log:
<timestamp> LOG: entering standby mode
<timestamp> LOG: restored log file "00000003.history"
<timestamp> LOG: invalid secondary checkpoint record
<timestamp> PANIC: could not locate a valid checkpoint record
Yep, that's about the best result you can expect with the above
procedure; it is also occasionally possible to get it to start, but
if it did there would almost certainly be data loss or corruption.

All this was originally caused by testing the failover mechanism in pgpool. That
didn't succeed and I'm trying to get the servers back to their original
states. I've done this kind
of thing before, but don't know what's wrong with this effort. What have
I missed?
You should enable WAL archiving and the restore_command in
recovery.conf should copy WAL files from the archive.  The pg_xlog
directory should be empty when starting recovery unless the primary
is stopped and you only copy pg_xlog files from the stopped server
into the pg_xlog directory of the recovery cluster.  Don't delete
the backup_label file, because it has the information recovery
needs about the point from which it should start WAL replay --
without it, it will have to guess, and is very likely to get that
wrong.

The documentation is your friend.  It gives pretty specific
instructions for what to do.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux