pg 8.1.3, AIX, huge box, painfully slow.

Gavin Hamill <gdh@xxxxxxxxxxxxx> · Fri, 7 Apr 2006 18:58:23 +0100

Bing-bong, passenger announcement.. the panic train is now pulling into
platform 8.1.3. Bing-bong. =)

OK, having moved from our quad-xeon to an 8-CPU IBM pSeries 650
(8x1.45GHz POWER4 instead of 4 x 3GHz Xeon), our query times have shot
up and our website is next to unusable. The IBM is not swapping (not
with 16GB of RAM!), disk i/o is low, but there must be something
critically wrong for this monster to be performing so badly..

There is little IO (maybe 500KB/sec), but the CPUs are often at 100%
usage.

VACUUM VERBOSE ANALYZE shows me 40000 page slots are needed to track
all free space. I have 160000 page slots configured, and this machine is
dedicated to pg.

The thing that really winds me up about this, is that aside from all
the normal 'my postgres is slow l0lz!' troubleshooting is the previous
machine (Debian sarge on four 3GHz Xeons) is using 8.1.3 also, with an
inferior I/O subsystem, and it churns through the workload very
merrily, only reaching a full loadavg of 4 at peak times, and running
our main 'hotelsearch' function in ~1000ms.. 

This IBM on the other hand is often taking 5-10 seconds to do the same
thing - although just by watching the logs it's clear to see the
workload coming in waves, and then calming down again. (this
correlation is matched by watching the load-balancer's logs as it takes
unresponsive webservers out of the cluster)

Here's the differences (I've removed obvious things like file/socket
paths) in "select name,setting from pg_catalog.pg_settings" between the
two:

--- cayenne     2006-04-07 18:43:48.000000000 +0100 # quad xeon
+++ jalapeno    2006-04-07 18:44:08.000000000 +0100 # ibm 650
- effective_cache_size            | 320000
+ effective_cache_size            | 640000
- integer_datetimes               | on
+ integer_datetimes               | off
- maintenance_work_mem            | 262144
+ maintenance_work_mem            | 1048576
- max_connections                 | 150
+ max_connections                 | 100
- max_fsm_pages                   | 66000
+ max_fsm_pages                   | 160000
- max_stack_depth                 | 2048
+ max_stack_depth                 | 16384
- tcp_keepalives_count            | 0
- tcp_keepalives_idle             | 0
- tcp_keepalives_interval         | 0
- temp_buffers                    | 1000
- TimeZone                        | GB
+ tcp_keepalives_count            | 8
+ tcp_keepalives_idle             | 7200
+ tcp_keepalives_interval         | 75
+ temp_buffers                    | 4000
+ TimeZone                        | GMT0BST,M3.5.0,M10.5.0
- wal_sync_method                 | fdatasync
- work_mem                        | 4096
+ wal_sync_method                 | open_datasync
+ work_mem                        | 16384

So, jalapeno really should have much more room to move. shared_buffers
is 60000 on both machines.

I'm reaching the end of my tether here - our search functions are just
so extensive and my pg knowledge is so small that it's overwhelming to
try and step through it to find any bottlenecks :(

Just to reiterate, it all runs great on cayenne since we trimmed a lot
of the fat out of the search, and I can't understand why the IBM box
isn't absolutely throwing queries out the door :)

Cheers,
Gavin.