Our production database is seeing very heavy CPU utilization - anyone
have any ideas/input considering the following?
CPU utilization gradually increases during the day until it approaches
90%-100% at our peak time. When this happens our transactions/sec
drops and our site becomes very slow. When in this state, I can see
hundreds of queries in pg_stat_activity that are not waiting on locks
but sit there for minutes. When the database is not in this state,
those same queries can complete in fractions of a second - faster that
my script that watches pg_stat_activity can keep track of them.
This server has dual quad core xeon 5310s, 32 GB RAM, and a few
different disk arrays (all managed in hardware by either the Perc5/i
or Perc5/e adapter). The Postgres data is on a 14 disk 7.2k SATA raid
10. This server runs nothing but Postgres.
The PostgreSQL database (according to pg_database_size) is 55GB and we
are running PostgreSQL 8.3.5 and the 2.6.28.7-2 kernel under Arch Linux.
Right now (not under peak load) this server is running at 68% CPU
utilization and its SATA raid 10 is doing about 2MB/s writes and 11MB/
s reads. When I run dd I can hit 200+MB/s writes and 230+ MB/s reads,
so we are barely using the available IO. Further when I run dd the
CPU utilization of that process only approaches 20%-30% of one core.
Additionally, when I view "top -c" I generally see a dozen or so
"idle" postgres processes (they appear and drop away quickly though)
consuming very large chunks of CPU (as much as 60% of a core). At any
given time we have hundreds of idle postgres processes due to the JDBC
connection pooling but most of them are 0% as I would expect them to
be. I also see selects and inserts consuming very large percentages
of CPU but I view that as less strange since they are doing work.
Any ideas as to what is causing our CPUs to struggle? Is the fact
that our RAM covers a significant portion of the database causing our
CPUs to do a bunch of thrashing as they work with memory while our
disk controllers sit idle? According to top we barely use any swap.
We currently have max_connections set to 1000 (roughly the sum of the
JDBC pools on our application servers). Would decreasing this value
help? We can decrease the JDBC pools or switch to pgbouncer for
pooling if this is the case.
Really just looking for any input/ideas. Our workload is primarily
OLTP in nature - essentially a social network. By transactions/sec at
the start I am using the xact_commit value in pg_stat_database.
Please let me know if this value is not appropriate for getting a tps
guess. Right now with the 60% CPU utilization and low IO use
xact_commit is increasing at a rate of 1070 a second.
I have an identical PITR slave I can pause the PITR sync on to run any
test against. I will happily provide any additional information that
would be helpful.
Any assistance is greatly appreciated.
Joe Uhl
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance