Re: delete duplicates takes too long

Miguel Miranda <miguel.mirandag@xxxxxxxxx> · Fri, 24 Apr 2009 18:05:08 -0600

I cant use a unique index because i only want to check for duplicates
where processed = 2, for simplicity i did not include that condition in
the example.

On Fri, Apr 24, 2009 at 5:50 PM, Scott Marlowe <scott.marlowe@xxxxxxxxx> wrote:

On Fri, Apr 24, 2009 at 5:37 PM, Miguel Miranda

<miguel.mirandag@xxxxxxxxx> wrote:

> hi , i hava a table:

> CREATE TABLE public.cdr_ama_stat (

> id int4 NOT NULL DEFAULT nextval('cdr_ama_stat_id_seq'::regclass),

> abonado_a varchar(30) NULL,

> abonado_b varchar(30) NULL,

> fecha_llamada timestamp NULL,

> duracion int4 NULL,

> puerto_a varchar(4) NULL,

> puerto_b varchar(4) NULL,

> tipo_llamada char(1) NULL,

> processed int4 NULL,

> PRIMARY KEY(id)

> )

> GO

> CREATE INDEX kpi_fecha_llamada

> ON public.cdr_ama_stat(fecha_llamada)

>

> there should be unique values for abonado_a, abonado_b, fecha_llamada,

> duracion in every row, googling around i found how to delete duplicates in

> postgresonline site ,

Then why not have a unique index on those rows together?

> so i run the following query (lets say i want to know how many duplicates

> exists for 2004-04-18, before delete them):

>

> SELECT * FROM cdr_ama_stat

> WHERE id NOT IN

> (SELECT MAX(dt.id)

> FROM cdr_ama_stat As dt

> WHERE dt.fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp +

> INTERVAL '1 day'

> GROUP BY dt.abonado_a, dt.abonado_b,dt.fecha_llamada,dt.duracion)

> AND fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp +

> INTERVAL '1 day'

>

> my problem is that the query take forever, number of rows:

Have you tried throwing more work_mem at the problem?

The other method to do this uses no group by but a join clause.

Depending on the number of dupes it can be faster or slow.

delete from table x where x.id in

    (select a.id from table a jon table b on (a.somefield=b.somefield

and a.id < b.id))

Or something like that.