Re: [HACKERS] ERROR: could not read block

"Kevin Grittner" <Kevin.Grittner@xxxxxxxxxxxx> · Thu, 17 Nov 2005 11:39:20 -0600

1) We run a couple Java applications on the same box to provide
middle tier access.  When the box is heavily loaded, I think I've
seen about 80% PostgreSQL, 20% Java load.

2) I checked that no antivirus software was running, and had the
techs pare down the services running on that box to the absolute
minimum after the second failure, so that we could eliminate such
issues as possible causes.

3) The aforementioned Java apps hold open 21 database
connections.  (One for a software publisher to query a list of jar
files for access to the database, and 20 for a connection pool in
the middle tier.)  The way the pool is configured, six of those are
used for queries of normal priority, so we rarely have more than
six connections doing anything an any one moment.  During the
initial failure, the middle tier was under normal load, so 45,000
inserts were made to the table in question during the ujpdate.
After we hit the problem, we removed that middle tier from the
list of targets, so it was running, but totally idle during the
remaining tests.

None of this seems material, however.  It's pretty clear that the
problem was exhaustion of the Windows page pool.  Our Windows
experts have reconfigured the machine (which had been tuned
for Sybase ASE).  Their changes have boosted the page pool
from 20,000 entries to 180,000 entries.  We're continuing to test
to ensure that the problem is not showing up with this
configuration; but, so far, it looks good.

If we don't want to tell Windows users to make highly technical
changes to the Windows registry in order to use PostgreSQL,
it does seem wise to use retries, as has already been discussed
on this thread.

-Kevin

>>> "Magnus Hagander" <mha@xxxxxxxxxxxxxx>  >>>
[copying this one over to hackers]

> Our DBAs reviewed the Microsoft documentation you referenced, 
> modified the registry, and rebooted the OS.  We've been 
> beating up on the database without seeing the error so far.  
> We'll keep at it for a while.

Very interesting. As this seems to be a resource error, a couple of
questions. Sorry if you've already answered some of them, couldn't find
it in the archives.

1) Is this a dedicated pg server, or does it have something else on it?

2) We have to ask this - do you run any antivirus on it, that might nto
be releasing resources the right way? Anything else that might stick in
a kernel driver?

3) Are you hitting the database with many connections, or is this a
single/few connection scenario? Are the other connections typically
active when this shows up?

Seems like we could just retry when we get this failure. The question is
we need to do a small amount of sleep before we do? Also, we can't just
retry forever, there has to be some kind of end to it...
(If you read the SQL kb, it can be read as retrying is the correct
thing, because the bug in sql was that it didn't retry)

//Magnus

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match