Database Recovery

"Mike C" <smith.not.western@xxxxxxxxx> · Mon, 8 Oct 2007 10:47:49 +1300

Hi,

I have a corrupt database. I copied the database directory offsite and
restored from a backup. I'm now trying to recover data in the corrupt
database added since the last backup. Any advice on how to approach
this?

It looks like my SAN raid controllers failed (both of them) after
putting in new disks... very suspicious. Anyway. After the SAN failed,
we had to reboot it, and upon rebooting I successfuly brought up the
database. To test consistency, I decided to vacuum full some selective
tables (DB ~200GB). It looked ok, so started using it again, then a
few minutes later the database was corrupt and completely unusable. At
that point I did a restore from a backup.

Errors from SAN (Just in case it adds something):

Oct  1 09:26:44 dbserver kernel: qla2300 0000:05:01.0: LOOP DOWN detected (2).
Oct  1 09:27:47 dbserver kernel: qla2x00: no more failovers for
request - pid= 3300789468
Oct  1 09:27:47 dbserver kernel: SCSI error : <1 0 0 2> return code = 0x10000
Oct  1 09:27:47 dbserver kernel: end_request: I/O error, dev sdd,
sector 179425575
Oct  1 09:27:47 dbserver kernel: Buffer I/O error on device sdd1,
logical block 22428189
Oct  1 09:27:47 dbserver kernel: lost page write due to I/O error on sdd1
Oct  1 09:27:47 dbserver kernel: qla2x00: no more failovers for
request - pid= 3300789494
Oct  1 09:27:47 dbserver kernel: SCSI error : <1 0 0 2> return code = 0x10000
Oct  1 09:27:47 dbserver kernel: qla2x00: no more failovers for
request - pid= 3300789498
...
Oct  1 09:27:48 dbserver postgres_report[22657]: [2-1] LOG:  could not
close temporary statistics file "global/pgstat.tmp": Input/output
error
...
Oct  1 09:28:04 dbserver postgres_report[22657]: [4-1] LOG:  could not
open temporary statistics file "global/pgstat.tmp": Read-only file
system
...
Oct  1 09:28:04 dbserver kernel: ext3_abort called.
Oct  1 09:28:04 dbserver kernel: EXT3-fs error (device sdd1):
ext3_journal_start_sb: Detected aborted journal
Oct  1 09:28:04 dbserver kernel: Remounting filesystem read-only
...
Oct  1 09:28:40 dbserver postgres_report[22953]: [31-1] ERROR:  could
not read block 10636 of relation 1663/24576/24907: Input/output error
...
Oct  1 09:28:40 dbserver postgres_report[30277]: [1-1] ERROR:  could
not read block 207084 of relation 1663/24576/24977: Success
...
Oct  1 09:28:43 dbserver postgres_report[22655]: [2-1] ERROR:  could
not access status of transaction 0
Oct  1 09:28:43 dbserver postgres_report[22655]: [2-2] DETAIL:  could
not open file "pg_clog/013F": Read-only file system
...
Oct  1 09:29:15 dbserver postgres_report[22655]: [3-1] ERROR:  could
not access status of transaction 0
...
Oct  1 09:29:15 dbserver postgres_report[13902]: [2-1] LOG:
unexpected EOF on client connection
Oct  1 09:29:15 dbserver postgres_report[22657]: [6-1] LOG:  could not
open temporary statistics file "global/pgstat.tmp": Read-only file
system
Oct  1 09:29:15 dbserver postgres_report[22655]: [3-2] DETAIL:  could
not open file "pg_clog/013F": Read-only file system
...
Oct  1 09:29:28 dbserver kernel: EXT3-fs error (device sdd1) in
ext3_prepare_write: IO failure
Oct  1 09:29:28 dbserver kernel: EXT3-fs error (device sdd1) in
start_transaction: Journal has aborted
etc...

After (supposedly) fixing the SAN (Before restoring from backup), just
trying to get the original db to start:

Oct  1 11:28:47 dbserver postgres_report[20898]: [1-1] LOG:  database
system was interrupted at 2007-10-04 09:17:24 NZST
Oct  1 11:28:48 dbserver postgres_report[20898]: [2-1] LOG:
checkpoint record is at 51F/6CEF22C8
Oct  1 11:28:48 dbserver postgres_report[20898]: [3-1] LOG:  redo
record is at 51F/6CEE4410; undo record is at 0/0; shutdown FALSE
Oct  1 11:28:48 dbserver postgres_report[20898]: [4-1] LOG:  next
transaction ID: 334744042; next OID: 32829
Oct  1 11:28:48 dbserver postgres_report[20898]: [5-1] LOG:  next
MultiXactId: 1; next MultiXactOffset: 0
Oct  1 11:28:48 dbserver postgres_report[20898]: [6-1] LOG:  database
system was not properly shut down; automatic recovery in progress
Oct  1 11:28:48 dbserver postgres_report[20898]: [7-1] LOG:  redo
starts at 51F/6CEE4410
Oct  1 11:30:01 dbserver postgres_report[20997]: [1-1] FATAL:  the
database system is starting up
Oct  1 11:30:11 dbserver postgres_report[20898]: [8-1] LOG:  record
with zero length at 51F/73A75718
Oct  1 11:30:11 dbserver postgres_report[20898]: [9-1] LOG:  redo done
at 51F/73A756E8
Oct  1 11:30:11 dbserver postgres_report[20898]: [10-1] LOG:  database
system is ready
Oct  1 11:30:11 dbserver postgres_report[20898]: [11-1] LOG:
transaction ID wrap limit is 1407525850, limited by database "reports"

At which point I did a vacuum full on select tables. No errors
reported in psql so thought ok. Then this happens:

Oct  1 12:10:02 dbserver postgres_report[22636]: [1-1] ERROR:  failed
to re-find parent key in "table_1_idx3" for split pages 230790/263989
Oct  1 12:10:10 dbserver postgres_report[22636]: [2-1] PANIC:  right
sibling's left-link doesn't match
Oct  1 12:10:10 dbserver postgres_report[22636]: [2-2] STATEMENT:
insert into TABLE_1 [blah blah blah]
Oct  1 12:10:10 dbserver postgres_report[20897]: [1-1] LOG:  server
process (PID 22636) was terminated by signal 6
Oct  1 12:10:10 dbserver postgres_report[20897]: [2-1] LOG:
terminating any other active server processes
Oct  1 12:10:10 dbserver postgres_report[22659]: [1-1] WARNING:
terminating connection because of crash of another server process
Oct  1 12:10:10 dbserver postgres_report[22662]: [1-1] WARNING:
terminating connection because of crash of another server process
Oct  1 12:10:10 dbserver postgres_report[22659]: [1-2] DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server
Oct  1 12:10:10 dbserver postgres_report[22659]: [1-3]  process exited
abnormally and possibly corrupted shared memory.
Oct  1 12:10:10 dbserver postgres_report[22662]: [1-2] DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server
Oct  1 12:10:10 dbserver postgres_report[22659]: [1-4] HINT:  In a
moment you should be able to reconnect to the database and repeat your
command.
Oct  1 12:10:10 dbserver postgres_report[22662]: [1-3]  process exited
abnormally and possibly corrupted shared memory.
Oct  1 12:10:10 dbserver postgres_report[22661]: [1-1] WARNING:
terminating connection because of crash of another server process
Oct  1 12:10:10 dbserver postgres_report[22662]: [1-4] HINT:  In a
moment you should be able to reconnect to the database and repeat your
command.
Oct  1 12:10:10 dbserver postgres_report[22660]: [1-1] WARNING:
terminating connection because of crash of another server process
Oct  1 12:10:10 dbserver postgres_report[22658]: [1-1] WARNING:
terminating connection because of crash of another server process
Oct  1 12:10:10 dbserver postgres_report[22653]: [1-1] WARNING:
terminating connection because of crash of another server process
Oct  1 12:10:10 dbserver postgres_report[22591]: [1-1] WARNING:
terminating connection because of crash of another server process
...
Oct  1 12:10:10 dbserver postgres_report[22661]: [1-3]  process exited
abnormally and possibly corrupted shared memory.
Oct  1 12:10:10 dbserver postgres_report[22660]: [1-3]  process exited
abnormally and possibly corrupted shared memory.
Oct  1 12:10:11 dbserver postgres_report[22658]: [1-3]  process exited
abnormally and possibly corrupted shared memory.
Oct  1 12:10:11 dbserver postgres_report[22653]: [1-3]  process exited
abnormally and possibly corrupted shared memory.
Oct  1 12:10:11 dbserver postgres_report[22675]: [3-1] FATAL:  the
database system is in recovery mode
Oct  1 12:10:11 dbserver postgres_report[22591]: [1-3]  process exited
abnormally and possibly corrupted shared memory.
Oct  1 12:10:11 dbserver postgres_report[22647]: [1-3]  process exited
abnormally and possibly corrupted shared memory.
Oct  1 12:10:11 dbserver postgres_report[22646]: [1-3]  process exited
abnormally and possibly corrupted shared memory.
...
Oct  1 12:10:12 dbserver postgres_report[22644]: [1-4] HINT:  In a
moment you should be able to reconnect to the database and repeat your
command.
Oct  1 12:10:12 dbserver postgres_report[22677]: [3-1] FATAL:  the
database system is in recovery mode
Oct  1 12:10:12 dbserver postgres_report[22637]: [1-4] HINT:  In a
moment you should be able to reconnect to the database and repeat your
command.
...
Oct  1 12:10:14 dbserver postgres_report[20897]: [3-1] LOG:  all
server processes terminated; reinitializing
Oct  1 12:10:14 dbserver postgres_report[22696]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:14 dbserver postgres_report[22697]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:14 dbserver postgres_report[22695]: [4-1] LOG:  database
system was interrupted at 2007-10-04 12:09:25 NZST
Oct  1 12:10:14 dbserver postgres_report[22698]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:14 dbserver postgres_report[22699]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:14 dbserver postgres_report[22695]: [5-1] LOG:
checkpoint record is at 520/66DDD00
Oct  1 12:10:14 dbserver postgres_report[22700]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:14 dbserver postgres_report[22701]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:14 dbserver postgres_report[22695]: [6-1] LOG:  redo
record is at 520/66DDD00; undo record is at 0/0; shutdown FALSE
Oct  1 12:10:14 dbserver postgres_report[22695]: [7-1] LOG:  next
transaction ID: 334746303; next OID: 32829
Oct  1 12:10:14 dbserver postgres_report[22702]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:14 dbserver postgres_report[22695]: [8-1] LOG:  next
MultiXactId: 1; next MultiXactOffset: 0
Oct  1 12:10:14 dbserver postgres_report[22703]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:14 dbserver postgres_report[22695]: [9-1] LOG:  database
system was not properly shut down; automatic recovery in progress
Oct  1 12:10:14 dbserver postgres_report[22704]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:15 dbserver postgres_report[22705]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:15 dbserver postgres_report[22695]: [10-1] LOG:  redo
starts at 520/66DDD50
Oct  1 12:10:15 dbserver postgres_report[22706]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:15 dbserver postgres_report[22707]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:15 dbserver postgres_report[22708]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:15 dbserver postgres_report[22709]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:15 dbserver postgres_report[22710]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:15 dbserver postgres_report[22695]: [11-1] LOG:  record
with zero length at 520/6E2D130
Oct  1 12:10:15 dbserver postgres_report[22695]: [12-1] LOG:  redo
done at 520/6E2D100
...
Oct  1 12:10:34 dbserver postgres_report[22888]: [4-1] FATAL:  the
database system is starting up
Oct  1 12:10:34 dbserver postgres_report[22695]: [13-1] PANIC:  failed
to re-find parent key in "24977" for split pages 230790/263989
Oct  1 12:10:34 dbserver postgres_report[20897]: [4-1] LOG:  startup
process (PID 22695) was terminated by signal 6
Oct  1 12:10:34 dbserver postgres_report[20897]: [5-1] LOG:  aborting
startup due to startup process failure

...

Tried to restart again:

Oct  1 12:28:44 dbserver postgres_report[23728]: [1-1] LOG:  database
system was interrupted while in recovery at 2007-10-04 12:10:15 NZST
Oct  1 12:28:44 dbserver postgres_report[23728]: [1-2] HINT:  This
probably means that some data is corrupted and you will have to use
the last backup for recovery.
Oct  1 12:28:44 dbserver postgres_report[23728]: [2-1] LOG:
checkpoint record is at 520/66DDD00
Oct  1 12:28:44 dbserver postgres_report[23728]: [3-1] LOG:  redo
record is at 520/66DDD00; undo record is at 0/0; shutdown FALSE
Oct  1 12:28:44 dbserver postgres_report[23728]: [4-1] LOG:  next
transaction ID: 334746303; next OID: 32829
Oct  1 12:28:44 dbserver postgres_report[23728]: [5-1] LOG:  next
MultiXactId: 1; next MultiXactOffset: 0
Oct  1 12:28:44 dbserver postgres_report[23728]: [6-1] LOG:  database
system was not properly shut down; automatic recovery in progress
Oct  1 12:28:44 dbserver postgres_report[23728]: [7-1] LOG:  redo
starts at 520/66DDD50
Oct  1 12:28:51 dbserver postgres_report[23728]: [8-1] LOG:  record
with zero length at 520/6E2D130
Oct  1 12:28:51 dbserver postgres_report[23728]: [9-1] LOG:  redo done
at 520/6E2D100
Oct  1 12:29:06 dbserver postgres_report[23728]: [10-1] PANIC:  failed
to re-find parent key in "24977" for split pages 230790/263989
Oct  1 12:29:06 dbserver postgres_report[23727]: [1-1] LOG:  startup
process (PID 23728) was terminated by signal 6
Oct  1 12:29:06 dbserver postgres_report[23727]: [2-1] LOG:  aborting
startup due to startup process failure

I see the HINT about restoring and do, but first fsck several times
over while various problems are fixed (corrupt/lost inodes etc).

So, back to my original problem. Based on what I've shown here, do you
think there is anything I can do to recover some of the data in the
corrupt copy? (I literally just did a mv data data.old then initdb'd
the new database and restored from backup). I guess the WAL files are
useless because of the vacuum I did used them all (or the recover
process from postgres itself)?

Regards,

Mike

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster