Re: intermittant performance problem

Mike Charnoky <noky@xxxxxxxxxxx> · Mon, 09 Mar 2009 22:10:53 -0400

Yeah, I wish I didn't have to resort to using ORDER BY RANDOM().  I 
really wanted to use something like TABLESAMPLE, but that is not 
implemented in PostgreSQL.  Unfortunately, I cannot use use the random 
sampling technique you mentioned, since I need to select samples across 
various strata of the data (in this case, where "item_name=something"), 
not just between timestamp ranges.  Guess I'll just have to try kicking 
up the work_mem for that query.

Thanks so much for your input.

Mike

Scott Marlowe wrote:
On Mon, Mar 9, 2009 at 1:55 PM, Mike Charnoky <noky@xxxxxxxxxxx> wrote:
Hello,

I'm looking for some insight on an intermittent PostgreSQL performance
problem that has been very troublesome.  Using PG 8.3.5 on a server
running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10
SCSI 600GB array).

The problem in a nutshell is this: on some days, a nightly sampling
process (which reads data from one very large table and writes to
another) runs about 2 orders of magnitude slower and disk IO goes
through the roof.  The sampling process (which starts at 1am and usually
takes ~30 minutes) takes many hours to complete and eventually ends up
interfering with other processes that need to access the database.
Other processes which need to write to the db get backed up and
eventually data gets dropped (ie: in memory queues waiting for db writes
get filled up).

The problem only happens on maybe one or two random days during the
week.  There is generally no other db activity during this time
(pg_stat_activity shows mostly idle connections).  It seems as if db
cache is not being used properly and heavy disk usage results.  Not sure
how to test this assumption.

Details are as follows:

1) The db contains a "raw_data" table which stores on the order of 10-15
million rows per day.  A total of two weeks of data are stored, total
table size is about 40GB (with indices).
2) Every day, a process runs which samples data from the "raw_data"
table and stores it to the "sampled_data" table for long term storage.
The sampling is done across two strata: time of day (time range) and
name of item (text field).  That is, the day is divided into about 5
chunks to ensure that we sample enough data for each chunk of time, for
each item.
3) The sampling process happens in two passes.  The first pass looks at
the data in order to determine the sample size required for each (time
of day, item name).  This consists of running some aggregate queries
over the entire dataset to be sampled (std dev, count, etc).  Sample
sizes are calculated for all items at once for a given chunk of time.
The second pass actually performs random sampling of the data and stores
the samples in the "sampled_data" table.  It is this second pass of the
sampling process that is running about 2 orders of magnitude slower!
4) After the sampling process finishes, the oldest day's worth of data
is deleted from the "raw_data" table.

The sampling query which runs really slow on some days looks something
like this:

INSERT INTO sampled_data
 (item_name, timestmp, ... )
 SELECT item_name, timestmp, ... )
 FROM raw_data
 WHERE timestmp >= ? and timestmp < ?
 AND item_name=?
 AND some_data_field NOTNULL
 ORDER BY random()
 LIMIT ?;

Have you got any other method for doing the sampling that order by
random()?  order by random() is the most inefficient way you could
possibly do this.  If you know the range of say, ids:

select max(id), min(id) from rawtable where timestmp >= ? and timestmp < ?

to get it.  Then use a random number generator to generate a list of
ids between those two ids, and select x rows from the database.

select * from rawtable where id in (1000 ids be here);

Will be WAY faster than order by random().

Changes made to postgresql.conf include:
max_connections = 250
shared_buffers = 1024MB
work_mem = 32MB

If you are married to order by random() then you might wanna crank up
work_mem while running that query.  I'd try something in the 128 to
512M range to start with.

Any pointers on how to troubleshoot this?

Try methods that don't involve order by random().

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general