Good morning, list.
We've got a bit of a problem on a customer's production box. We got a "missing chunk number 0 for toast value N" (N being a number) this week on their production box. We verified that it was only a problem with one row, tried to fix it with updates, and ended up deleting the row.
To check for similar issues in other tables, we set up a script to run at midnight and do a pg_dump on each individual table in the database where the original error happened, sending stderr to a log file. Since the original problem was discovered while running pg_dump, we figured this would show us any tables that have similar issues.
We found the same problem in a couple of other tables, but the big problem is that the same table that we just fixed had that error again, in a different row this time.
Some information on the customer's box: It's an Amazon EC2 box running debian (I believe debian 5, but I'm not sure). They are using postgres 8.3.11, installed from apt. They are mainly using ruby on rails for their application(s).
Here's the full error from the log file, minus (mildly) sensitive info:
--| Table schema.table dump start: Wed Aug 18 04:54:34 UTC 2010 |--
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR: missing chunk number 0 for toast value N in pg_toast_M
pg_dump: The command was: COPY schema.table (id, foreign_key, some_text_stuff, timestamp1, timestamp2) TO stdout;
--| Table schema.table dump end: Wed Aug 18 04:54:44 UTC 2010 |--
So the question is, what could be causing this? It's not so terrible a deal that we found that error in their database once, but this happened again right after we fixed it. Could it be ruby? The customer's application(s)? Some weirdness with Amazon EC2 and/or debian? A bug in postgres, itself? Any ideas?
-Sam