Recovery from PITR corrupted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,

INFO:
postgresql 9.0.4

I have a backup setup that is having an issue when I restore a backup for a development environment. I received some great help on the postgresql IRC channel on freenode, but unable to track down the exact cause thus far (some of my logging and backup files are now gone), so I thought I'd throw it out here to see if anyone has some additional input.

The backup system is as follows:

1. Master cluster exports WAL files and ships them to a backup server

2. The backup server then ships those WAL files to a warm standby server after they are 1 hour old (this is so that in a horrific data deletion happens, we can bring the warmstandby online within an hour and be online without the lengthy recovery time).

3. Once per night, we bring down the warm standby with a 'pg_ctl stop -m fast'. We then verify that the warm standby database is OFFLINE and preform a backup of the data directory, shipping that off to the backup server.


With this backup process, we've recovered several databases without issue, but today I've got one that is causing some issues.

Restore process:
1. Uncompress the tarball of the data directory

2. Stage WAL files in a location that pg_standby is looking for

3. Bring the cluster online, verify it's ingesting WAL files in standby mode (via pg_standby)

4. Create trigger file to signal the standby to come out of recovery mode.

After uncompressing the data directory from the backup from the warmstandby, then staging all the WAL files that were also archived, the database reached a consistent recovery state, and came online accepting connections. I can connect and issue queries. I then started to set up a hot standby for this cluster, and when I executed a pg_start_backup, i received the following error in the postgres logs:

Oct 2 20:05:44 localhost postgres[14030]: [1-1] user=,db= ERROR: xlog flush request 79D6/2DB52998 is not satisfied --- flushed only to 79D5/DC000020 Oct 2 20:05:44 localhost postgres[14030]: [1-2] user=,db= CONTEXT: writing block 9018 of relation base/2651908/1059795387 Oct 2 20:05:44 localhost postgres[22850]: [2-1] user=postgres,db=postgres ERROR: checkpoint request failed Oct 2 20:05:44 localhost postgres[22850]: [2-2] user=postgres,db=postgres HINT: Consult recent messages in the server log for details

I tracked down the table itself that it's pointing to, and it's a relatively small table. I can query all data from it without error. Just guessing at options, I executed a reindex on the table, and got the following output over 100 times a second:
WARNING:  concurrent delete in progress within table "mytablename"

(I also got the following a bit later on)
ERROR: index "pg_depend_depender_index" contains unexpected zero page at block 40087 autovacuum: found orphan temp table "pg_temp_28"."#DB_7716_INITIAL_SIZE_CHECK" in database "mydatabase"

I am currently preforming a new restore from a more recent backup (last night), and will be collecting stats and logs as I go so that I don't lose them.

Any thoughts on what this could be, or more good data to collect during my second recovery attempt here?

Currently I'll be collecting:
* postgres logs of the warm standby as it comes down, and back online after backup is taken
* my logs and timestamps of when the backup took place
* postgres logs from the recovery database as it comes online
* pg_controldata output for A. master database B. warm standby database C. recovery database before and after I actually bring it online.

Thanks in advance,
- Brian F



--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux