Re: Duplicate deletion optimizations

"Strange, John W" <john.w.strange@xxxxxxxxxxxx> · Fri, 6 Jan 2012 19:02:01 -0500

Are your stats updated on the table after you added the index?

- run the bad query with explain verbose on (you should send this anyways)
- check to see what the difference is in expected rows vs. actual rows
- make sure that your work_mem is high enough if you are sorting, if not you'll see it write out a temp file which will be slow.
- if there is different analyze the table and rerun the query to see if you get the expected results.
- I do believe having COUNT(*) > 1 will never use an index, but someone more experience can comment here.

-----Original Message-----
From: pgsql-performance-owner@xxxxxxxxxxxxxx [mailto:pgsql-performance-owner@xxxxxxxxxxxxxx] On Behalf Of antoine@xxxxxxxxx
Sent: Friday, January 06, 2012 8:36 AM
To: pgsql-performance@xxxxxxxxxxxxxx
Subject:  Duplicate deletion optimizations

Hello,

I've a table with approximately 50 million rows with a schema like
this:

     id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
     t_value integer NOT NULL DEFAULT 0,
     t_record integer NOT NULL DEFAULT 0,
     output_id integer NOT NULL DEFAULT 0,
     count bigint NOT NULL DEFAULT 0,
     CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in this table, but sometime, the process have to insert an already existing row (based on values in the triplet (t_value, t_record, output_id). In this case, the row must be updated with the new count value. I've tried some solution given on this stackoverflow question [1] but the insertion rate is always too low for my needs.

So, I've decided to do it in two times:

  - I insert all my new data with a COPY command
  - When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

     SELECT min(id) FROM stats_5mn
     GROUP BY t_value, t_record, output_id
     HAVING count(*) > 1;

The duration of the query on my test machine with approx. 16 million rows is ~18s.

To reduce this duration, I've tried to add an index on my triplet:

     CREATE INDEX test
       ON stats_5mn
       USING btree
       (t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and do a sequential scan [2], but if I force it with "SET enable_seqscan = off", the index is used [3] and query duration is lowered to ~5s.

My questions:

  - Why the planner refuse to use my index?
  - Is there a better method for my problem?

Thanks by advance for your help,
Antoine Millet.

[1]
http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql

http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different

[2] http://explain.depesz.com/s/UzW :
     GroupAggregate  (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20) (actual time=15663.549..17463.458 rows=7621805 loops=1)
               Sort Key: delta, kilo, four
               Sort Method:  external merge  Disk: 223512kB
             ->  Seq Scan on three  (cost=0.000..139734.140 rows=7621814
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)

[3] http://explain.depesz.com/s/o9P :
     GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20) (actual time=5307.734..5307.734 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Index Scan using charlie on three  (cost=0.000..11422738.330
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805
loops=1)

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance