> What do your various logs (pgsql, application, etc...) have to say? There is hardly anything helpful in the pgsql log. The application log doesn't mention anything either. We log a great deal of information in our application, but there's nothing out of the ordinary there, although there's of course always a chance that somewhere we missed something. > Can you read a java stack trace? Sometimes slogging through them will > reveal some useful information. I can read a java stack trace very well, I'm primarily a Java developer. The stack trace is the following one: org.postgresql.util.PSQLException: An I/O error occured while sending to the backend. at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:218) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:451) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:350) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:304) at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:102) at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:102) Our code simply executes the same statement.executeUpdate() that is does about 500.000 times during a business day. As soon as this select query is hitting this 22 GB table then there's a chance that 'suddenly' all these utterly simply insert queries start failing. The insert query is nothing special either. It's just an "INSERT INTO ... VALUES (...)" type of thing. The select query can actually be a few different kinds of queries, but basically the common thing between them is reading from this 22 GB table. In fact, our system administrator just told me that even the DB backup is able to trigger this behaviour. As soon as the backup process is reading from this 22 GB table, the inserts on it -may- start to fail. > Sounds to me like your connections are timing out (what's your timeout > in jdbc set to?) There's no explicit timeout being set. Queries can theoretically execute for hours. In some rare cases, some queries indeed run for that long. > A likely cause is that you're getting big checkpoint spikes. What > does vmstat 10 say during these spikes? It's hard to reproduce the problem. We're trying to simulate it in on our testing servers but haven't been successfull yet. The problem typically lasts for only a minute a time on the production server and there's no saying on when it will occur again. Of course we could try to enfore it by running this select query continously, but for a production server it's not an easy decission to actually do that. So therefore basically all we were able to do now is investigate the logs afterwards. I'll try to run vmstat though when the problem happens when I'm at the console. > If you're running the > sysstate service with data collection then sar can tell you a lot. Ok, I'm not a big PG expert so I'll have to look into what that means exactly ;) Thanks for the advice though. Kind regards, Henk Express yourself instantly with MSN Messenger! MSN Messenger |