On Mon, Mar 9, 2009 at 1:55 PM, Mike Charnoky <noky@xxxxxxxxxxx> wrote: > Hello, > > I'm looking for some insight on an intermittent PostgreSQL performance > problem that has been very troublesome. Using PG 8.3.5 on a server > running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10 > SCSI 600GB array). > > The problem in a nutshell is this: on some days, a nightly sampling > process (which reads data from one very large table and writes to > another) runs about 2 orders of magnitude slower and disk IO goes > through the roof. The sampling process (which starts at 1am and usually > takes ~30 minutes) takes many hours to complete and eventually ends up > interfering with other processes that need to access the database. > Other processes which need to write to the db get backed up and > eventually data gets dropped (ie: in memory queues waiting for db writes > get filled up). > > The problem only happens on maybe one or two random days during the > week. There is generally no other db activity during this time > (pg_stat_activity shows mostly idle connections). It seems as if db > cache is not being used properly and heavy disk usage results. Not sure > how to test this assumption. > > Details are as follows: > > 1) The db contains a "raw_data" table which stores on the order of 10-15 > million rows per day. A total of two weeks of data are stored, total > table size is about 40GB (with indices). > 2) Every day, a process runs which samples data from the "raw_data" > table and stores it to the "sampled_data" table for long term storage. > The sampling is done across two strata: time of day (time range) and > name of item (text field). That is, the day is divided into about 5 > chunks to ensure that we sample enough data for each chunk of time, for > each item. > 3) The sampling process happens in two passes. The first pass looks at > the data in order to determine the sample size required for each (time > of day, item name). This consists of running some aggregate queries > over the entire dataset to be sampled (std dev, count, etc). Sample > sizes are calculated for all items at once for a given chunk of time. > The second pass actually performs random sampling of the data and stores > the samples in the "sampled_data" table. It is this second pass of the > sampling process that is running about 2 orders of magnitude slower! > 4) After the sampling process finishes, the oldest day's worth of data > is deleted from the "raw_data" table. > > The sampling query which runs really slow on some days looks something > like this: > > INSERT INTO sampled_data > (item_name, timestmp, ... ) > SELECT item_name, timestmp, ... ) > FROM raw_data > WHERE timestmp >= ? and timestmp < ? > AND item_name=? > AND some_data_field NOTNULL > ORDER BY random() > LIMIT ?; Have you got any other method for doing the sampling that order by random()? order by random() is the most inefficient way you could possibly do this. If you know the range of say, ids: select max(id), min(id) from rawtable where timestmp >= ? and timestmp < ? to get it. Then use a random number generator to generate a list of ids between those two ids, and select x rows from the database. select * from rawtable where id in (1000 ids be here); Will be WAY faster than order by random(). > Changes made to postgresql.conf include: > max_connections = 250 > shared_buffers = 1024MB > work_mem = 32MB If you are married to order by random() then you might wanna crank up work_mem while running that query. I'd try something in the 128 to 512M range to start with. > Any pointers on how to troubleshoot this? Try methods that don't involve order by random(). -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general