corruption in system tables (9.1.13)

Jan-Pieter Cornet <johnpc@xxxxxxxxxx> · Fri, 19 Sep 2014 17:29:49 +0200

Hi,

One of our postgres database clusters suddenly developed a corruption in the system tables. I tried to debug this as best I could, but haven't found a root cause yet. I'm mainly seeking more pointers to attack this.

The error message that suddenly appeared for a lot of queries (but not all queries), is:

ERROR:  missing chunk number 0 for toast value 4132453 in pg_toast_2619

The database was running 9.1.13 at the time (upgraded to 9.1.14 now, but the releasenotes for 9.1.14 do not mention this sort of corruption. We do not use GiST indices). We cannot upgrade to 9.3.x at the moment because of dependencies on other components (it's high on the list of priorities, though).

I googled this error and found that sometimes judicious use of "reindex" and "vacuum full" might solve this, but it didn't.

The corrupt database was renamed and put aside for study. We also saved the pg_xlog files from around the time the corruption occurred. Unfortunately, we do not have pg_basebackup-like backups (yet - also high on the wishlist now). We restored to a fresh database from pg_dump made 30 minutes before the corruption occurred, and were able to replay the missing transactions using detailed logging from the connecting system. We also made a pg_basebackup-copy of the system including the faulty database.

There are no IO errors, so this does not look like disk corruption. Also, this machine uses streaming replication to replicate to a hot standby slave, and the slave is corrupt in the exact same way (producing the exact same errors). That to me is more proof that this is not caused by a faulty disk, since database changes are (as far as I know) stored in the base/ directories, and at the same time streamed to streaming replication slaves (and written to pg_xlog), right?

This system has been in production since July, and has been running fine ever since. It could still be a hardware problem (memory corruption?), but if it is, it's quite rare.

As an example, this is what it looks like to try to query the broken database:
sim_stuk=# \d
                       List of relations
 Schema |                Name                |   Type   | Owner 
--------+------------------------------------+----------+-------
 public | _dbversioning                      | table    | admin
 public | _dbversioning_id_seq               | sequence | admin
 public | sim_instance                       | table    | admin
 public | sim_instance_id_seq                | sequence | admin
[... works fine ...]

sim_stuk=# \d sim_instance
ERROR:  missing chunk number 0 for toast value 4132453 in pg_toast_2619
sim_stuk=# select * from sim_instance limit 1;
ERROR:  missing chunk number 0 for toast value 4132461 in pg_toast_2619
sim_stuk=# select * from pg_attribute limit 1;
ERROR:  missing chunk number 0 for toast value 4132453 in pg_toast_2619

There are some tables we can still query, though, among these is a table containing "status" fields and timestamps, so we could learn approximately how many changes were missing from the backup that we restored.

Any suggestions on how to proceed? Thanks!

-- 
Jan-Pieter Cornet <johnpc@xxxxxxxxxx>
"Any sufficiently advanced incompetence is indistinguishable from malice."
    - Grey's Law

Attachment:
signature.asc

Description: OpenPGP digital signature