Hi All,
INFO:
postgresql 9.0.4
I have a backup setup that is having an issue when I restore a backup
for a development environment. I received some great help on the
postgresql IRC channel on freenode, but unable to track down the exact
cause thus far (some of my logging and backup files are now gone), so I
thought I'd throw it out here to see if anyone has some additional input.
The backup system is as follows:
1. Master cluster exports WAL files and ships them to a backup server
2. The backup server then ships those WAL files to a warm standby server
after they are 1 hour old (this is so that in a horrific data deletion
happens, we can bring the warmstandby online within an hour and be
online without the lengthy recovery time).
3. Once per night, we bring down the warm standby with a 'pg_ctl stop -m
fast'. We then verify that the warm standby database is OFFLINE and
preform a backup of the data directory, shipping that off to the backup
server.
With this backup process, we've recovered several databases without
issue, but today I've got one that is causing some issues.
Restore process:
1. Uncompress the tarball of the data directory
2. Stage WAL files in a location that pg_standby is looking for
3. Bring the cluster online, verify it's ingesting WAL files in standby
mode (via pg_standby)
4. Create trigger file to signal the standby to come out of recovery mode.
After uncompressing the data directory from the backup from the
warmstandby, then staging all the WAL files that were also archived, the
database reached a consistent recovery state, and came online accepting
connections. I can connect and issue queries. I then started to set up a
hot standby for this cluster, and when I executed a pg_start_backup, i
received the following error in the postgres logs:
Oct 2 20:05:44 localhost postgres[14030]: [1-1] user=,db= ERROR: xlog
flush request 79D6/2DB52998 is not satisfied --- flushed only to
79D5/DC000020
Oct 2 20:05:44 localhost postgres[14030]: [1-2] user=,db= CONTEXT:
writing block 9018 of relation base/2651908/1059795387
Oct 2 20:05:44 localhost postgres[22850]: [2-1]
user=postgres,db=postgres ERROR: checkpoint request failed
Oct 2 20:05:44 localhost postgres[22850]: [2-2]
user=postgres,db=postgres HINT: Consult recent messages in the server
log for details
I tracked down the table itself that it's pointing to, and it's a
relatively small table. I can query all data from it without error. Just
guessing at options, I executed a reindex on the table, and got the
following output over 100 times a second:
WARNING: concurrent delete in progress within table "mytablename"
(I also got the following a bit later on)
ERROR: index "pg_depend_depender_index" contains unexpected zero page
at block 40087
autovacuum: found orphan temp table
"pg_temp_28"."#DB_7716_INITIAL_SIZE_CHECK" in database "mydatabase"
I am currently preforming a new restore from a more recent backup (last
night), and will be collecting stats and logs as I go so that I don't
lose them.
Any thoughts on what this could be, or more good data to collect during
my second recovery attempt here?
Currently I'll be collecting:
* postgres logs of the warm standby as it comes down, and back online
after backup is taken
* my logs and timestamps of when the backup took place
* postgres logs from the recovery database as it comes online
* pg_controldata output for A. master database B. warm standby database
C. recovery database before and after I actually bring it online.
Thanks in advance,
- Brian F
--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin