Initiating log shipping backups when the initial filesystem-level backup went out of sync

Wouter Verhelst <wouter.verhelst.ext@xxxxxxxxxx> · Mon, 29 Aug 2016 12:02:14 +0000

Hi,

A while back, I had to replace the hardware for a postgres server containing about 600G of data. While at it, we did an upgrade of the postgresql version used, and we had intended to also modify the way backups were being done, by moving from a daily pg_dumpall in cron to shipping WAL logs to an offsite location.

Due to the many changes involved, I decided that it would be easier to copy the data by running "ssh oldserver pg_dumpall | psql postgres", but I did (approximately) the following:

- Create the cluster
- Set "wal_level = hot_standby", "archive_mode = on", and "archive_command = 'test ! -f /srv/pg_wal/%f && cp %p /srv/pg_wal/%f' in postgresql.conf.
- Set up everything so that files written to /srv/pg_wal end up (eventually) on the remote server.
- rsync the data directory to the off-site server
- Start the server
- Run the pg_dumpall thing
- Start the server on the off-site location, to test if it could open everything, and stop it again
- Create a file "recovery.conf" in the wrong location (at first) and then in the correct location (later on, after reading the documentation more carefully), with a line saying "restore_command = 'cp /srv/pg_wal/%f "%p"'" and one saying "standby_mode = on"
- Start the server

At this point, the log file shows the following message:

2016-08-29 13:53:22 CEST [25504-1] LOG:  database system was shut down in recovery at (...)
2016-08-29 13:53:22 CEST [25504-2] LOG:  entering standby mode
2016-08-29 13:53:22 CEST [25504-3] LOG:  record with zero length at 0/2000200
2016-08-29 13:53:22 CEST [25504-4] LOG:  invalid primary checkpoint record
2016-08-29 13:53:22 CEST [25504-5] LOG:  record with zero length at 0/2000198
2016-08-29 13:53:22 CEST [25504-6] LOG:  invalid secondary checkpoint record
2016-08-29 13:53:22 CEST [25504-7] PANIC:  could not locate a valid checkpoint record
2016-08-29 13:53:22 CEST [25480-1] LOG:  startup process (PID 25504) was terminated by signal 6: Aborted
2016-08-29 13:53:22 CEST [25480-2] LOG:  aborting startup due to startup process failure

In the pg_xlog directory on the off-site server, there is a file 000000010000000000000002 and one 000000010000000000000003. When I run md5sum, I find that the first has the same checksum as the one with the same name in the pg_wal directory containing the files from the original server; the second does not, but it *does* have the same checksum as the file 000000010000000000000001 in that directory (which does not exist in the pg_xlog directory on the off-site server).

I have retained all WAL files from 000000010000000000000001 all the way to the current one.

Is it still possible for me to load these WAL files into the server at the remote site, without having to do a filesystem-level copy of all the files in the postgresql database directory? If not, any hints on what I did wrong?

Thanks,

-- 
Wouter Verhelst

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin