Hi, I have a corrupt database. I copied the database directory offsite and restored from a backup. I'm now trying to recover data in the corrupt database added since the last backup. Any advice on how to approach this? It looks like my SAN raid controllers failed (both of them) after putting in new disks... very suspicious. Anyway. After the SAN failed, we had to reboot it, and upon rebooting I successfuly brought up the database. To test consistency, I decided to vacuum full some selective tables (DB ~200GB). It looked ok, so started using it again, then a few minutes later the database was corrupt and completely unusable. At that point I did a restore from a backup. Errors from SAN (Just in case it adds something): Oct 1 09:26:44 dbserver kernel: qla2300 0000:05:01.0: LOOP DOWN detected (2). Oct 1 09:27:47 dbserver kernel: qla2x00: no more failovers for request - pid= 3300789468 Oct 1 09:27:47 dbserver kernel: SCSI error : <1 0 0 2> return code = 0x10000 Oct 1 09:27:47 dbserver kernel: end_request: I/O error, dev sdd, sector 179425575 Oct 1 09:27:47 dbserver kernel: Buffer I/O error on device sdd1, logical block 22428189 Oct 1 09:27:47 dbserver kernel: lost page write due to I/O error on sdd1 Oct 1 09:27:47 dbserver kernel: qla2x00: no more failovers for request - pid= 3300789494 Oct 1 09:27:47 dbserver kernel: SCSI error : <1 0 0 2> return code = 0x10000 Oct 1 09:27:47 dbserver kernel: qla2x00: no more failovers for request - pid= 3300789498 ... Oct 1 09:27:48 dbserver postgres_report[22657]: [2-1] LOG: could not close temporary statistics file "global/pgstat.tmp": Input/output error ... Oct 1 09:28:04 dbserver postgres_report[22657]: [4-1] LOG: could not open temporary statistics file "global/pgstat.tmp": Read-only file system ... Oct 1 09:28:04 dbserver kernel: ext3_abort called. Oct 1 09:28:04 dbserver kernel: EXT3-fs error (device sdd1): ext3_journal_start_sb: Detected aborted journal Oct 1 09:28:04 dbserver kernel: Remounting filesystem read-only ... Oct 1 09:28:40 dbserver postgres_report[22953]: [31-1] ERROR: could not read block 10636 of relation 1663/24576/24907: Input/output error ... Oct 1 09:28:40 dbserver postgres_report[30277]: [1-1] ERROR: could not read block 207084 of relation 1663/24576/24977: Success ... Oct 1 09:28:43 dbserver postgres_report[22655]: [2-1] ERROR: could not access status of transaction 0 Oct 1 09:28:43 dbserver postgres_report[22655]: [2-2] DETAIL: could not open file "pg_clog/013F": Read-only file system ... Oct 1 09:29:15 dbserver postgres_report[22655]: [3-1] ERROR: could not access status of transaction 0 ... Oct 1 09:29:15 dbserver postgres_report[13902]: [2-1] LOG: unexpected EOF on client connection Oct 1 09:29:15 dbserver postgres_report[22657]: [6-1] LOG: could not open temporary statistics file "global/pgstat.tmp": Read-only file system Oct 1 09:29:15 dbserver postgres_report[22655]: [3-2] DETAIL: could not open file "pg_clog/013F": Read-only file system ... Oct 1 09:29:28 dbserver kernel: EXT3-fs error (device sdd1) in ext3_prepare_write: IO failure Oct 1 09:29:28 dbserver kernel: EXT3-fs error (device sdd1) in start_transaction: Journal has aborted etc... After (supposedly) fixing the SAN (Before restoring from backup), just trying to get the original db to start: Oct 1 11:28:47 dbserver postgres_report[20898]: [1-1] LOG: database system was interrupted at 2007-10-04 09:17:24 NZST Oct 1 11:28:48 dbserver postgres_report[20898]: [2-1] LOG: checkpoint record is at 51F/6CEF22C8 Oct 1 11:28:48 dbserver postgres_report[20898]: [3-1] LOG: redo record is at 51F/6CEE4410; undo record is at 0/0; shutdown FALSE Oct 1 11:28:48 dbserver postgres_report[20898]: [4-1] LOG: next transaction ID: 334744042; next OID: 32829 Oct 1 11:28:48 dbserver postgres_report[20898]: [5-1] LOG: next MultiXactId: 1; next MultiXactOffset: 0 Oct 1 11:28:48 dbserver postgres_report[20898]: [6-1] LOG: database system was not properly shut down; automatic recovery in progress Oct 1 11:28:48 dbserver postgres_report[20898]: [7-1] LOG: redo starts at 51F/6CEE4410 Oct 1 11:30:01 dbserver postgres_report[20997]: [1-1] FATAL: the database system is starting up Oct 1 11:30:11 dbserver postgres_report[20898]: [8-1] LOG: record with zero length at 51F/73A75718 Oct 1 11:30:11 dbserver postgres_report[20898]: [9-1] LOG: redo done at 51F/73A756E8 Oct 1 11:30:11 dbserver postgres_report[20898]: [10-1] LOG: database system is ready Oct 1 11:30:11 dbserver postgres_report[20898]: [11-1] LOG: transaction ID wrap limit is 1407525850, limited by database "reports" At which point I did a vacuum full on select tables. No errors reported in psql so thought ok. Then this happens: Oct 1 12:10:02 dbserver postgres_report[22636]: [1-1] ERROR: failed to re-find parent key in "table_1_idx3" for split pages 230790/263989 Oct 1 12:10:10 dbserver postgres_report[22636]: [2-1] PANIC: right sibling's left-link doesn't match Oct 1 12:10:10 dbserver postgres_report[22636]: [2-2] STATEMENT: insert into TABLE_1 [blah blah blah] Oct 1 12:10:10 dbserver postgres_report[20897]: [1-1] LOG: server process (PID 22636) was terminated by signal 6 Oct 1 12:10:10 dbserver postgres_report[20897]: [2-1] LOG: terminating any other active server processes Oct 1 12:10:10 dbserver postgres_report[22659]: [1-1] WARNING: terminating connection because of crash of another server process Oct 1 12:10:10 dbserver postgres_report[22662]: [1-1] WARNING: terminating connection because of crash of another server process Oct 1 12:10:10 dbserver postgres_report[22659]: [1-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server Oct 1 12:10:10 dbserver postgres_report[22659]: [1-3] process exited abnormally and possibly corrupted shared memory. Oct 1 12:10:10 dbserver postgres_report[22662]: [1-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server Oct 1 12:10:10 dbserver postgres_report[22659]: [1-4] HINT: In a moment you should be able to reconnect to the database and repeat your command. Oct 1 12:10:10 dbserver postgres_report[22662]: [1-3] process exited abnormally and possibly corrupted shared memory. Oct 1 12:10:10 dbserver postgres_report[22661]: [1-1] WARNING: terminating connection because of crash of another server process Oct 1 12:10:10 dbserver postgres_report[22662]: [1-4] HINT: In a moment you should be able to reconnect to the database and repeat your command. Oct 1 12:10:10 dbserver postgres_report[22660]: [1-1] WARNING: terminating connection because of crash of another server process Oct 1 12:10:10 dbserver postgres_report[22658]: [1-1] WARNING: terminating connection because of crash of another server process Oct 1 12:10:10 dbserver postgres_report[22653]: [1-1] WARNING: terminating connection because of crash of another server process Oct 1 12:10:10 dbserver postgres_report[22591]: [1-1] WARNING: terminating connection because of crash of another server process ... Oct 1 12:10:10 dbserver postgres_report[22661]: [1-3] process exited abnormally and possibly corrupted shared memory. Oct 1 12:10:10 dbserver postgres_report[22660]: [1-3] process exited abnormally and possibly corrupted shared memory. Oct 1 12:10:11 dbserver postgres_report[22658]: [1-3] process exited abnormally and possibly corrupted shared memory. Oct 1 12:10:11 dbserver postgres_report[22653]: [1-3] process exited abnormally and possibly corrupted shared memory. Oct 1 12:10:11 dbserver postgres_report[22675]: [3-1] FATAL: the database system is in recovery mode Oct 1 12:10:11 dbserver postgres_report[22591]: [1-3] process exited abnormally and possibly corrupted shared memory. Oct 1 12:10:11 dbserver postgres_report[22647]: [1-3] process exited abnormally and possibly corrupted shared memory. Oct 1 12:10:11 dbserver postgres_report[22646]: [1-3] process exited abnormally and possibly corrupted shared memory. ... Oct 1 12:10:12 dbserver postgres_report[22644]: [1-4] HINT: In a moment you should be able to reconnect to the database and repeat your command. Oct 1 12:10:12 dbserver postgres_report[22677]: [3-1] FATAL: the database system is in recovery mode Oct 1 12:10:12 dbserver postgres_report[22637]: [1-4] HINT: In a moment you should be able to reconnect to the database and repeat your command. ... Oct 1 12:10:14 dbserver postgres_report[20897]: [3-1] LOG: all server processes terminated; reinitializing Oct 1 12:10:14 dbserver postgres_report[22696]: [4-1] FATAL: the database system is starting up Oct 1 12:10:14 dbserver postgres_report[22697]: [4-1] FATAL: the database system is starting up Oct 1 12:10:14 dbserver postgres_report[22695]: [4-1] LOG: database system was interrupted at 2007-10-04 12:09:25 NZST Oct 1 12:10:14 dbserver postgres_report[22698]: [4-1] FATAL: the database system is starting up Oct 1 12:10:14 dbserver postgres_report[22699]: [4-1] FATAL: the database system is starting up Oct 1 12:10:14 dbserver postgres_report[22695]: [5-1] LOG: checkpoint record is at 520/66DDD00 Oct 1 12:10:14 dbserver postgres_report[22700]: [4-1] FATAL: the database system is starting up Oct 1 12:10:14 dbserver postgres_report[22701]: [4-1] FATAL: the database system is starting up Oct 1 12:10:14 dbserver postgres_report[22695]: [6-1] LOG: redo record is at 520/66DDD00; undo record is at 0/0; shutdown FALSE Oct 1 12:10:14 dbserver postgres_report[22695]: [7-1] LOG: next transaction ID: 334746303; next OID: 32829 Oct 1 12:10:14 dbserver postgres_report[22702]: [4-1] FATAL: the database system is starting up Oct 1 12:10:14 dbserver postgres_report[22695]: [8-1] LOG: next MultiXactId: 1; next MultiXactOffset: 0 Oct 1 12:10:14 dbserver postgres_report[22703]: [4-1] FATAL: the database system is starting up Oct 1 12:10:14 dbserver postgres_report[22695]: [9-1] LOG: database system was not properly shut down; automatic recovery in progress Oct 1 12:10:14 dbserver postgres_report[22704]: [4-1] FATAL: the database system is starting up Oct 1 12:10:15 dbserver postgres_report[22705]: [4-1] FATAL: the database system is starting up Oct 1 12:10:15 dbserver postgres_report[22695]: [10-1] LOG: redo starts at 520/66DDD50 Oct 1 12:10:15 dbserver postgres_report[22706]: [4-1] FATAL: the database system is starting up Oct 1 12:10:15 dbserver postgres_report[22707]: [4-1] FATAL: the database system is starting up Oct 1 12:10:15 dbserver postgres_report[22708]: [4-1] FATAL: the database system is starting up Oct 1 12:10:15 dbserver postgres_report[22709]: [4-1] FATAL: the database system is starting up Oct 1 12:10:15 dbserver postgres_report[22710]: [4-1] FATAL: the database system is starting up Oct 1 12:10:15 dbserver postgres_report[22695]: [11-1] LOG: record with zero length at 520/6E2D130 Oct 1 12:10:15 dbserver postgres_report[22695]: [12-1] LOG: redo done at 520/6E2D100 ... Oct 1 12:10:34 dbserver postgres_report[22888]: [4-1] FATAL: the database system is starting up Oct 1 12:10:34 dbserver postgres_report[22695]: [13-1] PANIC: failed to re-find parent key in "24977" for split pages 230790/263989 Oct 1 12:10:34 dbserver postgres_report[20897]: [4-1] LOG: startup process (PID 22695) was terminated by signal 6 Oct 1 12:10:34 dbserver postgres_report[20897]: [5-1] LOG: aborting startup due to startup process failure ... Tried to restart again: Oct 1 12:28:44 dbserver postgres_report[23728]: [1-1] LOG: database system was interrupted while in recovery at 2007-10-04 12:10:15 NZST Oct 1 12:28:44 dbserver postgres_report[23728]: [1-2] HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. Oct 1 12:28:44 dbserver postgres_report[23728]: [2-1] LOG: checkpoint record is at 520/66DDD00 Oct 1 12:28:44 dbserver postgres_report[23728]: [3-1] LOG: redo record is at 520/66DDD00; undo record is at 0/0; shutdown FALSE Oct 1 12:28:44 dbserver postgres_report[23728]: [4-1] LOG: next transaction ID: 334746303; next OID: 32829 Oct 1 12:28:44 dbserver postgres_report[23728]: [5-1] LOG: next MultiXactId: 1; next MultiXactOffset: 0 Oct 1 12:28:44 dbserver postgres_report[23728]: [6-1] LOG: database system was not properly shut down; automatic recovery in progress Oct 1 12:28:44 dbserver postgres_report[23728]: [7-1] LOG: redo starts at 520/66DDD50 Oct 1 12:28:51 dbserver postgres_report[23728]: [8-1] LOG: record with zero length at 520/6E2D130 Oct 1 12:28:51 dbserver postgres_report[23728]: [9-1] LOG: redo done at 520/6E2D100 Oct 1 12:29:06 dbserver postgres_report[23728]: [10-1] PANIC: failed to re-find parent key in "24977" for split pages 230790/263989 Oct 1 12:29:06 dbserver postgres_report[23727]: [1-1] LOG: startup process (PID 23728) was terminated by signal 6 Oct 1 12:29:06 dbserver postgres_report[23727]: [2-1] LOG: aborting startup due to startup process failure I see the HINT about restoring and do, but first fsck several times over while various problems are fixed (corrupt/lost inodes etc). So, back to my original problem. Based on what I've shown here, do you think there is anything I can do to recover some of the data in the corrupt copy? (I literally just did a mv data data.old then initdb'd the new database and restored from backup). I guess the WAL files are useless because of the vacuum I did used them all (or the recover process from postgres itself)? Regards, Mike ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster