Database corruption

Waldo Nell <pwnell@xxxxxxxxxxxx> · Fri, 1 Sep 2006 17:02:24 +0200

We have recently upgraded from PostgreSQL 7.4.5 to 8.1.4.  Our DB is  
about 45GB in size and has about 100 tables, lots of stored  
procedures, triggers etc.

The DB lived on a SAN and is accessed via Fibre Channel cards in an  
IBM BladeCenter.  The system is running RedHat Linux Enterprise 3 I  
think. We connect to it using the latest JDBC driver from Java 1.5.

Here is the problem.  Yesterday night I saw one of my statements in  
my code threw this exception:

net.za.pwnconsulting.dblayer.query.DBQueryLayerException: Could not  
execute SQL query [select c.mc_ca_id as c_mc_ca_id, d.mc_ca_id as  
d_mc_ca_id, b.ca_id, b.ca_ctr_id from contract a, contract_agreement  
b left outer join monthend_contract c on c.mc_ca_id = ? and  
c.mc_last_proc_date = ? left outer join monthend2_contract d on  
d.mc_ca_id = ? and d.mc_last_proc_date = ? where a.ctr_id =  
b.ca_ctr_id and b.ca_agreement_type = ? and b.ca_start_date > ? and  
b.ca_start_date <= ? and a.ctr_id = ? and ca_level_company = ? and  
ca_level_data = ? and ca_level_type = 'V' and ca_id <> ? limit 1] in  
pool [Oasis] because: ERROR: out of memory
  Detail: Failed on request of size 20.
org.postgresql.util.PSQLException: ERROR: out of memory
  Detail: Failed on request of size 20.
	at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse 
(QueryExecutorImpl.java:1525)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults 
(QueryExecutorImpl.java:1309)
	at org.postgresql.core.v3.QueryExecutorImpl.execute 
(QueryExecutorImpl.java:188)
	at org.postgresql.jdbc2.AbstractJdbc2Statement.execute 
(AbstractJdbc2Statement.java:452)
	at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags 
(AbstractJdbc2Statement.java:340)
	at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery 
(AbstractJdbc2Statement.java:239)
	at  
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabaseInter 
nal(StandardQueryLayer.java:289)
	at  
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabaseInter 
nalExceptionWrapper(StandardQueryLayer.java:158)
	at  
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabase 
(StandardQueryLayer.java:68)
	at  
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabase 
(StandardQueryLayer.java:58)
	at net.za.pwnconsulting.dblayer.AbstractDBLayer.queryDatabase 
(AbstractDBLayer.java:57)
	at  
za.co.massmart.oasis.daemons.ContractExtensionTimerTask.safeExecuteInter 
nal(ContractExtensionTimerTask.java:183)
	at za.co.massmart.oasis.daemons.ExtendedHACronTimerTask.safeExecute 
(ExtendedHACronTimerTask.java:94)
	at net.za.pwnconsulting.javaconfig.utils.time.HACronTimerTask.execute 
(HACronTimerTask.java:156)
	at net.za.pwnconsulting.javaconfig.utils.time.CronTimerTask.run 
(CronTimerTask.java:103)
	at java.lang.Thread.run(Thread.java:595)

So I checked and saw another application connecting to another DB in  
the same instance showed a very weird error.  For an "update  
tablename set field1 = X where primary_field = Y" statement I got a  
duplicate key exception on the serial field "primary_field" - which  
is impossible since I am not changing its value - I am updating a  
different column.  Vacuum full analyze did not fix this.  I dropped  
and recreated the index, then the error went away.  This was just  
after I restarted the postgres instance (I stopped it using pg_ctl  
stop -m immediate" since there were connections I could not close  
from other systems.

I restarted the application that encountered the first error above  
and ran the task again - it worked.  So I then started a vacuum full  
analyze on the main production DB.  It produced several warnings  
about page sizes not being large enough I think, but it completed  
successfully.

This morning we found some data generated this morning at 01:00 was  
present, but most of the tables lost all data captured since 25  
August 2006.  That amounts to about 500 contracts, a while billing  
cycle etc.  We upgraded postgresql on the 20th.

I checked the file system and it yielded no errors (ext3).  I checked  
the server and there were no obvious hardware issues.

How can part of the DB just be lost like that?  I have just verified  
by restoring a backup made yesterday night at 20:00 - before the  
errors I showed above happened - that DB also has the data missing.   
The bill run happened yesterday afternoon so somewhere between that  
something broke.  Any ideas on how to trouble shoot this?

To summarise - the DB lost about 6 days of transactions for no known  
reason.

Attachment:
smime.p7s

Description: S/MIME cryptographic signature