Postgres gets stuck

"Craig A. James" <cjames@xxxxxxxxxxxxxxxx> · Tue, 09 May 2006 17:38:17 -0700

I'm having a rare but deadly problem.  On our web servers, a process occasionally gets stuck, and can't be unstuck.  Once it's stuck, all Postgres activities cease.  "kill -9" is required to kill it -- signals 2 and 15 don't work, and "/etc/init.d/postgresql stop" fails.

Here's what the process table looks like:

$ ps -ef | grep postgres
postgres 30713     1  0 Apr24 ?        00:02:43 /usr/local/pgsql/bin/postmaster -p 5432 -D /disk3/postgres/data
postgres 25423 30713  0 May08 ?        00:03:34 postgres: writer process
postgres 25424 30713  0 May08 ?        00:00:02 postgres: stats buffer process
postgres 25425 25424  0 May08 ?        00:00:02 postgres: stats collector process
postgres 11918 30713 21 07:37 ?        02:00:27 postgres: production webuser 127.0.0.1(21772) SELECT
postgres 31624 30713  0 16:11 ?        00:00:00 postgres: production webuser [local] idle
postgres 31771 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12422) idle
postgres 31772 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12421) idle
postgres 31773 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12424) idle
postgres 31774 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12425) idle
postgres 31775 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12426) idle
postgres 31776 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12427) idle
postgres 31777 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12428) idle

The SELECT process is the one that's stuck.  top(1) and other indicators show that nothing is going on at all (no CPU usage, normal memory usage); the process seems to be blocked waiting for something.  (The "idle" processes are attached to a FastCGI program.)

This has happened on *two different machines*, both doing completely different tasks.  The first one is essentially a read-only warehouse that serves lots of queries, and the second one is the server we use to load the warehouse.  In both cases, Postgres has been running for a long time, and is issuing SELECT statements that it's issued millions of times before with no problems.  No other processes are accessing Postgres, just the web services.

This is a deadly bug, because our web site goes dead when this happens, and it requires an administrator to log in and kill the stuck postgres process then restart Postgres.  We've installed failover system so that the web site is diverted to a backup server, but since this has happened twice in one week, we're worried.

Any ideas?

Details:

   Postgres 8.0.3
   Linux 2.6.12-1.1381_FC3smp i686 i386

   Dell 2-CPU Xeon system (hyperthreading is enabled)
   4 GB memory
   2 120 GB disks (SATA on machine 1, IDE on machine 2)

Thanks,
Craig