Hi all, like always, time is short and we have a very critical situation here, so please help to save the day once again. Installation facts: - 4 High Availability Clusters consisting of 2 Dual Xeon Server machines (Dell PowerEdge 2850) using Heartbeat1. Each server has only one harddisk, no RAID configuration is used. - Each Cluster has 4 drbd Devices, one for the PostgreSQL data - Two of these clusters are using the same PostgreSQL installation to share the data, the database can be moved from one cluster to the other in case of failure - OS: Debian Sarge with postgresql-7.4.7-6sarge1 - Two cronjobs are configured to perform a "vacuumdb --all --full" every 10 minutes and a "vacuumdb --all --full --analyze" every 23 minutes - There are 3 applications using the PostgreSQL installation, each with their own database. - The main application is based on Tomcat 4.1-30 partly a Web-Application, partly a terminal login protocol based on http and XML, so the database access is of course done using JDBC - A cronjob is configured to perform a pg_dump of the main database every 4 hours Usage: - There are 3 tables which are heavily used by inserting and deleting some hundred entries per second, the others are rarely used by the Web-Application. One of these 3 table is the heaviest used one, also with locking mechanisms. - There are some pgsql functions which use locking mechanisms (select for update) - When a user logs in into the Web-Application, about 20 tables are created which contain the data used for the Web-Application. The last 10 versions of these 20 tables -> about 200 tables are used as old versions which can be used to restore previous configuration data (pretty bad solution). These tables were never used after their creation, just the oldest one is deleted if there are >10 versions - Unfortunately we are not using Database connection pooling - There are some threads which have a permanent database connection and perform some actions either on a regular basis or on demand. Problems: 1. The first problem we faced was that the pg_dump which is performed every 4 hours began to fail about 1-2 times a week because of the following error: pg_dump: attempt to lock table "tableblabla" failed: ERROR: catalog is missing 5 attribute(s) for relid 44798701 The table for which the errors occur is always one of these 200 unused tables containing previous configuration data. The recovery procedure always consists of deleting all entries with relname tableblablabla from the pg_class, dumping and restoring the database. 2. Happens about once a month: Starting with select and delete queries regarding the heaviest used database table, the database starts to not respond to different queries which leads to a unusable system. If the database is in such a state, it is not even possible to connect to the database using psql on the commandline. The recovery procedure was to just restart the PostgreSQL service. 3. There is a table which contains some configuration values which are never changed. A thread of the application has a permanent database connection reads a value from this table every 10 seconds and then performs some actions. Sometimes, the value cannot be read from this table, which means that the same SELECT statement executed every 10 seconds delivers no result, although the data is never modified. The table is never subject to locking or anything like that. After some queries failed, suddenly the correct value is returned again. 4. Once, the following error prevents creation of some tables (postgresql.log): 2006-08-30 17:50:39 [11189] ERROR: 23505: duplicate key violates unique constraint "pg_class_oid_index" followed by 2006-08-30 18:00:11 [18296] ERROR: XX000: catalog is missing 10 attribute(s) for relid 17586461 for every run of the vacuum cronjob. This errors occurred together with the freeze described in 3. Are there any known bugs of the used version of PostgreSQL, Tomcat, JDBC which could cause such problems? All these problems occur independently on both 2 cluster systems, so i think there is no hardware error. If you have any suggestions or ideas for debugging or what the reason for these problems could be, please reply as soon as possible. br, Peter