Had some database corruption problems today. Since they came on the
heels of making some minor database changes yesterday, they may or may
not be related to that. Centos 4.x, Postgresql 8.1.4
I modified the following settings and then issued a reload. I hadn't
turned up the kernel.shmmax to allow for these bigger memory settings,
but the database continued to run fine.
shared_buffers = 10000
work_mem = 2048
autovacuum = on # enable autovacuum subprocess?
autovacuum_naptime = 60 # time between autovacuum runs,
in secs
autovacuum_vacuum_threshold = 1000 # min # of tuple updates before
autovacuum_analyze_threshold = 500 # min # of tuple updates before
autovacuum_vacuum_scale_factor = 0.4 # fraction of rel size before
autovacuum_analyze_scale_factor = 0.2 # fraction of rel size before
autovacuum_vacuum_cost_delay = -1 # default vacuum cost delay for
autovacuum_vacuum_cost_limit = -1 # default vacuum cost limit for
After reloading I ran a number of vacuumdb -a -z which completed
successfully.
Sometime after the vacuum or reload one of our clients started to have
database problems. The other databases in the same postgresql on the
server don't seem to be affected.
When I finally got the error report in the morning the database was in
this state:
$ psql dbname
dbname=# \dt
ERROR: cache lookup failed for relation 20884
Doing a select * from pg_tables seemed to indicate that some of the
tables were no longer in the database, also some other tables were
inaccessible.
I made a backup and then some functionality was restored by issuing a
reindex system dbname
Using the "broken database" pg_dump on all tables in pg_table gives this
for some tables:
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR: cache lookup failed for
relation 20871
pg_dump: The command was: SELECT tableoid, oid, conname,
pg_catalog.pg_get_const
raintdef(oid) as condef FROM pg_catalog.pg_constraint WHERE conrelid =
'20876'::
pg_catalog.oid AND contype = 'f'
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR: relation
"public.auth_groups_permissions" does not exist
pg_dump: The command was: LOCK TABLE public.auth_groups_permissions IN
ACCESS SH
ARE MODE
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR: relation with OID 21186 does
not exist
pg_dump: The command was: LOCK TABLE public.ght_ght_shippingorders IN
ACCESS SHA
RE MODE
Our backups failed of course, and we have made a good attempt at
recovery, which we are willing to accept as all the recovery we can do.
As a last chance to get some of the data back, I would be interested if
there is any way to read through the raw database files to see if I can
recover some more data from them.
I'm also curious if any of the settings/reload caused this problem, or
perhaps the vacuum or autovacuum is what caused our error. I'll file a
bug report if it's somehow repeatable.
Oh, and make backups.
-Mike