locking issue on simple selects?

Tobias Brox <tobixen@xxxxxxxxx> · Wed, 15 Sep 2010 12:05:03 +0200

We have a production database server ... it's quite busy but usually
working completely fine, simple queries taking a fraction of a
millisecond to run.

Recently we've frequently encountered issues where some simple selects
(meaning, selects doing an index lookup and fetching one row) have
become stuck for several minutes.  Apparently all requests on one
exact table gets stuck, all requests not related to said table are
going through without any problems.  According to the pg_stat_activity
view, all queries getting stuck was read-queries (selects), no updates
or anything like that (some of the transactions were doing updates
and/or inserts though).

The obvious thought seems to be that this is a locking issue ... but
it doesn't seem so.  For one thing, AFAIK locking shouldn't affect
selects, only updates?  I've also looked through tons of logs without
finding any obvious locking issues.  In one of the instances, I could
find that there were some open transactions doing updates on one row
in the table and then later becoming stuck (as all other transactions)
when doing a select on another row in the said table.

My second thought was that the database is on the edge of being
overloaded and that the memory situation is also just on the edge ...
important indexes that used to be in memory now has to be accessed
from the disk.  Still, it doesn't make sense, we're not seeing any
serious impact on the CPU iowait status, and it seems improbable that
it should take minutes to load an index?

There aren't any long-lasting transactions going on when the jam
occurs.  I haven't checked much up, usually the jam seems to resolve
itself pretty instantly, but I think that at some point it took half a
minute from the first query was finished until the pg_stat_activity
view got back to normal (meaning typically 0-5 simultaneously
processed queries).

FWIW, we're running pg 8.3.11, transaction isolation level
serializable.  The machine is quad-core hyperthreaded (16 CPUs visible
to the OS), a SAN is used for storage, and different RAIDs are used
for the root partition, pg data and pg wals.

Any ideas?  I'm aware that some configuration (i.e. checkpoint
interval etc) may cause significant delay on write-queries ... but
this is only read-queries.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance