On Fri, Jan 6, 2012 at 6:35 AM, <antoine@xxxxxxxxx> wrote: > Hello, > > I've a table with approximately 50 million rows with a schema like this: > > id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass), > t_value integer NOT NULL DEFAULT 0, > t_record integer NOT NULL DEFAULT 0, > output_id integer NOT NULL DEFAULT 0, > count bigint NOT NULL DEFAULT 0, > CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id) > > Every 5 minutes, a process have to insert a few thousand of rows in this > table, > but sometime, the process have to insert an already existing row (based on > values in the triplet (t_value, t_record, output_id). In this case, the row > must be updated with the new count value. I've tried some solution given on > this > stackoverflow question [1] but the insertion rate is always too low for my > needs. What are your needs? It should take no special hardware or coding to be able to manage a few thousand rows over 5 minutes. > So, I've decided to do it in two times: > > - I insert all my new data with a COPY command > - When it's done, I run a delete query to remove oldest duplicates > > Right now, my delete query look like this: > > SELECT min(id) FROM stats_5mn > GROUP BY t_value, t_record, output_id > HAVING count(*) > 1; > > The duration of the query on my test machine with approx. 16 million rows is > ~18s. > > To reduce this duration, I've tried to add an index on my triplet: > > CREATE INDEX test > ON stats_5mn > USING btree > (t_value , t_record , output_id ); > > By default, the PostgreSQL planner doesn't want to use my index and do a > sequential > scan [2], but if I force it with "SET enable_seqscan = off", the index is > used [3] > and query duration is lowered to ~5s. > > > My questions: > > - Why the planner refuse to use my index? It thinks that using the index will be about 9 times more expensive than the full scan. Probably your settings for seq_page_cost and random_page_cost are such that the planner thinks that nearly every buffer read is going to be from disk. But in reality (in this case) your data is all in memory. So the planner is mis-estimating. (It would help verify this if you did your EXPLAIN ANALYZE with BUFFERS as well). But before trying to fix this by tweaking settings, will the real case always be like your test case? If the data stops being all in memory, either because the problem size increases or because you have to compete for buffer space with other things going on, then using the index scan could be catastrophic. Cheers, Jeff -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance