intermittant performance problem

Mike Charnoky <noky@xxxxxxxxxxx> · Mon, 09 Mar 2009 15:55:46 -0400

Hello,

I'm looking for some insight on an intermittent PostgreSQL performance
problem that has been very troublesome.  Using PG 8.3.5 on a server
running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10
SCSI 600GB array).

The problem in a nutshell is this: on some days, a nightly sampling
process (which reads data from one very large table and writes to
another) runs about 2 orders of magnitude slower and disk IO goes
through the roof.  The sampling process (which starts at 1am and usually
takes ~30 minutes) takes many hours to complete and eventually ends up
interfering with other processes that need to access the database.
Other processes which need to write to the db get backed up and
eventually data gets dropped (ie: in memory queues waiting for db writes
get filled up).

The problem only happens on maybe one or two random days during the
week.  There is generally no other db activity during this time
(pg_stat_activity shows mostly idle connections).  It seems as if db
cache is not being used properly and heavy disk usage results.  Not sure
how to test this assumption.

Details are as follows:

1) The db contains a "raw_data" table which stores on the order of 10-15
million rows per day.  A total of two weeks of data are stored, total
table size is about 40GB (with indices).
2) Every day, a process runs which samples data from the "raw_data"
table and stores it to the "sampled_data" table for long term storage.
The sampling is done across two strata: time of day (time range) and
name of item (text field).  That is, the day is divided into about 5
chunks to ensure that we sample enough data for each chunk of time, for
each item.
3) The sampling process happens in two passes.  The first pass looks at
the data in order to determine the sample size required for each (time
of day, item name).  This consists of running some aggregate queries
over the entire dataset to be sampled (std dev, count, etc).  Sample
sizes are calculated for all items at once for a given chunk of time.
The second pass actually performs random sampling of the data and stores
the samples in the "sampled_data" table.  It is this second pass of the
sampling process that is running about 2 orders of magnitude slower!
4) After the sampling process finishes, the oldest day's worth of data
is deleted from the "raw_data" table.

The sampling query which runs really slow on some days looks something
like this:

INSERT INTO sampled_data
  (item_name, timestmp, ... )
  SELECT item_name, timestmp, ... )
  FROM raw_data
  WHERE timestmp >= ? and timestmp < ?
  AND item_name=?
  AND some_data_field NOTNULL
  ORDER BY random()
  LIMIT ?;

We have done a great deal of PG tuning, including the autovacuum for the
"raw_data" table.  Autovacuum kicks like clockwork every day on that
table after the sampling process finishes (after one day's worth of data
is deleted from "raw_data" table, a roughly 7% change in size).

Changes made to postgresql.conf include:
max_connections = 250
shared_buffers = 1024MB
work_mem = 32MB
maintenance_work_mem = 256MB
max_fsm_pages = 10000000
max_fsm_relations = 30000
checkpoint_segments = 64
checkpoint_timeout = 10min
checkpoint_warning = 1min

Any pointers on how to troubleshoot this?

Mike

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general