The DB lived on a SAN and is accessed via Fibre Channel cards in an IBM BladeCenter. The system is running RedHat Linux Enterprise 3 I think. We connect to it using the latest JDBC driver from Java 1.5.
Here is the problem. Yesterday night I saw one of my statements in my code threw this exception:
net.za.pwnconsulting.dblayer.query.DBQueryLayerException: Could not execute SQL query [select c.mc_ca_id as c_mc_ca_id, d.mc_ca_id as d_mc_ca_id, b.ca_id, b.ca_ctr_id from contract a, contract_agreement b left outer join monthend_contract c on c.mc_ca_id = ? and c.mc_last_proc_date = ? left outer join monthend2_contract d on d.mc_ca_id = ? and d.mc_last_proc_date = ? where a.ctr_id = b.ca_ctr_id and b.ca_agreement_type = ? and b.ca_start_date > ? and b.ca_start_date <= ? and a.ctr_id = ? and ca_level_company = ? and ca_level_data = ? and ca_level_type = 'V' and ca_id <> ? limit 1] in pool [Oasis] because: ERROR: out of memory
Detail: Failed on request of size 20. org.postgresql.util.PSQLException: ERROR: out of memory Detail: Failed on request of size 20.at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse (QueryExecutorImpl.java:1525) at org.postgresql.core.v3.QueryExecutorImpl.processResults (QueryExecutorImpl.java:1309) at org.postgresql.core.v3.QueryExecutorImpl.execute (QueryExecutorImpl.java:188) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute (AbstractJdbc2Statement.java:452) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags (AbstractJdbc2Statement.java:340) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery (AbstractJdbc2Statement.java:239) at net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabaseInter nal(StandardQueryLayer.java:289) at net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabaseInter nalExceptionWrapper(StandardQueryLayer.java:158) at net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabase (StandardQueryLayer.java:68) at net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabase (StandardQueryLayer.java:58) at net.za.pwnconsulting.dblayer.AbstractDBLayer.queryDatabase (AbstractDBLayer.java:57) at za.co.massmart.oasis.daemons.ContractExtensionTimerTask.safeExecuteInter nal(ContractExtensionTimerTask.java:183) at za.co.massmart.oasis.daemons.ExtendedHACronTimerTask.safeExecute (ExtendedHACronTimerTask.java:94) at net.za.pwnconsulting.javaconfig.utils.time.HACronTimerTask.execute (HACronTimerTask.java:156) at net.za.pwnconsulting.javaconfig.utils.time.CronTimerTask.run (CronTimerTask.java:103)
at java.lang.Thread.run(Thread.java:595)So I checked and saw another application connecting to another DB in the same instance showed a very weird error. For an "update tablename set field1 = X where primary_field = Y" statement I got a duplicate key exception on the serial field "primary_field" - which is impossible since I am not changing its value - I am updating a different column. Vacuum full analyze did not fix this. I dropped and recreated the index, then the error went away. This was just after I restarted the postgres instance (I stopped it using pg_ctl stop -m immediate" since there were connections I could not close from other systems.
I restarted the application that encountered the first error above and ran the task again - it worked. So I then started a vacuum full analyze on the main production DB. It produced several warnings about page sizes not being large enough I think, but it completed successfully.
This morning we found some data generated this morning at 01:00 was present, but most of the tables lost all data captured since 25 August 2006. That amounts to about 500 contracts, a while billing cycle etc. We upgraded postgresql on the 20th.
I checked the file system and it yielded no errors (ext3). I checked the server and there were no obvious hardware issues.
How can part of the DB just be lost like that? I have just verified by restoring a backup made yesterday night at 20:00 - before the errors I showed above happened - that DB also has the data missing. The bill run happened yesterday afternoon so somewhere between that something broke. Any ideas on how to trouble shoot this?
To summarise - the DB lost about 6 days of transactions for no known reason.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature