Re: Duplicate deletion optimizations

Jeff Janes <jeff.janes@xxxxxxxxx> · Sat, 7 Jan 2012 10:54:51 -0800

On Fri, Jan 6, 2012 at 6:35 AM,  <antoine@xxxxxxxxx> wrote:
> Hello,
>
> I've a table with approximately 50 million rows with a schema like this:
>
>    id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
>    t_value integer NOT NULL DEFAULT 0,
>    t_record integer NOT NULL DEFAULT 0,
>    output_id integer NOT NULL DEFAULT 0,
>    count bigint NOT NULL DEFAULT 0,
>    CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)
>
> Every 5 minutes, a process have to insert a few thousand of rows in this
> table,
> but sometime, the process have to insert an already existing row (based on
> values in the triplet (t_value, t_record, output_id). In this case, the row
> must be updated with the new count value. I've tried some solution given on
> this
> stackoverflow question [1] but the insertion rate is always too low for my
> needs.

What are your needs?  It should take no special hardware or coding to
be able to manage a few thousand rows over 5 minutes.

> So, I've decided to do it in two times:
>
>  - I insert all my new data with a COPY command
>  - When it's done, I run a delete query to remove oldest duplicates
>
> Right now, my delete query look like this:
>
>    SELECT min(id) FROM stats_5mn
>    GROUP BY t_value, t_record, output_id
>    HAVING count(*) > 1;
>
> The duration of the query on my test machine with approx. 16 million rows is
> ~18s.
>
> To reduce this duration, I've tried to add an index on my triplet:
>
>    CREATE INDEX test
>      ON stats_5mn
>      USING btree
>      (t_value , t_record , output_id );
>
> By default, the PostgreSQL planner doesn't want to use my index and do a
> sequential
> scan [2], but if I force it with "SET enable_seqscan = off", the index is
> used [3]
> and query duration is lowered to ~5s.
>
>
> My questions:
>
>  - Why the planner refuse to use my index?

It thinks that using the index will be about 9 times more expensive
than the full scan.  Probably your settings for seq_page_cost and
random_page_cost are such that the planner thinks that nearly every
buffer read is going to be from disk.  But in reality (in this case)
your data is all in memory.  So the planner is mis-estimating.  (It
would help verify this if you did your EXPLAIN ANALYZE with BUFFERS as
well).  But before trying to fix this by tweaking settings, will the
real case always be like your test case?  If the data stops being all
in memory, either because the problem size increases or because you
have to compete for buffer space with other things going on, then
using the index scan could be catastrophic.

Cheers,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance