Re: Duplicate deletion optimizations

antoine@xxxxxxxxx · Fri, 06 Jan 2012 16:21:06 +0100

On Fri, 06 Jan 2012 15:35:36 +0100, antoine@xxxxxxxxx wrote:
Hello,

I've a table with approximately 50 million rows with a schema like 
this:

    id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
    t_value integer NOT NULL DEFAULT 0,
    t_record integer NOT NULL DEFAULT 0,
    output_id integer NOT NULL DEFAULT 0,
    count bigint NOT NULL DEFAULT 0,
    CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in
this table,
but sometime, the process have to insert an already existing row 
(based on
values in the triplet (t_value, t_record, output_id). In this case, 
the row
must be updated with the new count value. I've tried some solution
given on this
stackoverflow question [1] but the insertion rate is always too low
for my needs.

So, I've decided to do it in two times:

 - I insert all my new data with a COPY command
 - When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

    SELECT min(id) FROM stats_5mn
    GROUP BY t_value, t_record, output_id
    HAVING count(*) > 1;

Correction:

    DELETE FROM stats_5mn WHERE id in (
        SELECT min(id) FROM stats_5mn
        GROUP BY t_value, t_record, output_id
        HAVING count(*) > 1;
    );

Sorry :-)

The duration of the query on my test machine with approx. 16 million
rows is ~18s.

To reduce this duration, I've tried to add an index on my triplet:

    CREATE INDEX test
      ON stats_5mn
      USING btree
      (t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and
do a sequential
scan [2], but if I force it with "SET enable_seqscan = off", the
index is used [3]
and query duration is lowered to ~5s.

My questions:

 - Why the planner refuse to use my index?
 - Is there a better method for my problem?

Thanks by advance for your help,
Antoine Millet.

[1]

http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql

http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different

[2] http://explain.depesz.com/s/UzW :
    GroupAggregate  (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
        Filter: (five(*) > 1)
      ->  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20)
(actual time=15663.549..17463.458 rows=7621805 loops=1)
              Sort Key: delta, kilo, four
              Sort Method:  external merge  Disk: 223512kB
            ->  Seq Scan on three  (cost=0.000..139734.140
rows=7621814 width=20) (actual time=0.041..2093.434 rows=7621805
loops=1)

[3] http://explain.depesz.com/s/o9P :
    GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20)
(actual time=5307.734..5307.734 rows=0 loops=1)
        Filter: (five(*) > 1)
      ->  Index Scan using charlie on three
(cost=0.000..11422738.330 rows=7621814 width=20) (actual
time=0.046..2062.952 rows=7621805 loops=1)

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance