Sun, 23 Dec 2012 18:45:11 -0800 (PST) Received: by 10.49.34.135 with SMTP id z7mr3027069qei.1.1356317111181; Sun, 23 Dec 2012 18:45:11 -0800 (PST) X-Newsgroups: pgsql.general Date: Sun, 23 Dec 2012 18:45:11 -0800 (PST) In-Reply-To: <aa6b08350904241637n4aabb598q88b00b4d51895010@xxxxxxxxxxxxxx> Complaints-To: groups-abuse@xxxxxxxxxx Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=203.223.94.194; posting-account=vWIzlgoAAACC2chFthxRV-tOg_E-Eyba References: <aa6b08350904241637n4aabb598q88b00b4d51895010@xxxxxxxxxxxxxx> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <53e63980-d1bd-498e-a8d2-caeaf4e00de2@xxxxxxxxxxxxxxxx> Subject: Re: delete duplicates takes too long From: dhakaboy13@xxxxxxxxx Injection-Date: Mon, 24 Dec 2012 02:45:11 +0000 Content-Type: text/plain; charset=ISO-8859-1 To: pgsql-general@xxxxxxxxxxxxxx > > CREATE TABLE public.cdr_ama_stat ( > > id int4 NOT NULL DEFAULT nextval('cdr_ama_stat_id_seq'::regclass), > > abonado_a varchar(30) NULL, > > abonado_b varchar(30) NULL, > > fecha_llamada timestamp NULL, > > duracion int4 NULL, > > puerto_a varchar(4) NULL, > > puerto_b varchar(4) NULL, > > tipo_llamada char(1) NULL, > > processed int4 NULL, > > PRIMARY KEY(id) > > ) > > GO > > CREATE INDEX kpi_fecha_llamada > > ON public.cdr_ama_stat(fecha_llamada) > > > > there should be unique values for abonado_a, abonado_b, fecha_llamada, > duracion in every row, googling around i found how to delete duplicates > in postgresonline site , > > so i run the following query (lets say i want to know how many duplicates exists for 2004-04-18, before delete them): > > > > SELECT * FROM cdr_ama_stat > > WHERE id NOT IN > > (SELECT MAX(dt.id) > > FROM cdr_ama_stat As dt > > WHERE dt.fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp + INTERVAL '1 day' > > GROUP BY dt.abonado_a, dt.abonado_b,dt.fecha_llamada,dt.duracion) > > AND fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp + INTERVAL '1 day' > > > > my problem is that the query take forever, number of rows: > > kpi=# select count(*) from cdr_ama_stat; > > count > > --------- > > 5908065 > > (1 row) > > > > this the explain result on the above query: > > > > QUERY PLAN > > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > Index Scan using kpi_fecha_llamada on cdr_ama_stat > (cost=115713.94..79528582.40 rows=140809 width=50) > > Index Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp > without time zone) AND (fecha_llamada <= '2009-04-19 > 00:00:00'::timestamp without time zone)) > > Filter: (NOT (subplan)) > > SubPlan > > -> Materialize (cost=115713.94..116202.56 rows=28162 width=34) > > -> GroupAggregate (cost=110902.49..115478.78 rows=28162 width=34) > > -> Sort (cost=110902.49..111606.53 rows=281618 width=34) > > Sort Key: dt.abonado_a, dt.abonado_b, dt.fecha_llamada, dt.duracion > > -> Bitmap Heap Scan on cdr_ama_stat dt (cost=8580.03..70970.30 > rows=281618 width=34) > > Recheck Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp > without time zone) AND (fecha_llamada <= '2009-04-19 > 00:00:00'::timestamp without time zone)) > > -> Bitmap Index Scan on kpi_fecha_llamada (cost=0.00..8509.62 > rows=281618 width=0) > > Index Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp > without time zone) AND (fecha_llamada <= '2009-04-19 > 00:00:00'::timestamp without time zone)) > > > > > > am i doing something wrong? > > I think several minutes should be ok, but not several hours as happens > now, i do a bulk load (350k rows aprox) in that table every day, and > yes, i did vacuum full analyze the database before posting this results. > > Server is Dual Xeon 3.0 Ghx, 2Gb RAM > > > > best regards http://DuplicateFilesDeleter.com -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general