[no subject]

News Subsystem <news@xxxxxxxxxxxx> · Sun, 23 Dec 2012 22:45:13 -0400 (AST)



        Sun, 23 Dec 2012 18:45:11 -0800 (PST)
Received: by 10.49.34.135 with SMTP id z7mr3027069qei.1.1356317111181; Sun, 23
 Dec 2012 18:45:11 -0800 (PST)
X-Newsgroups: pgsql.general
Date: Sun, 23 Dec 2012 18:45:11 -0800 (PST)
In-Reply-To: <aa6b08350904241637n4aabb598q88b00b4d51895010@xxxxxxxxxxxxxx>
Complaints-To: groups-abuse@xxxxxxxxxx
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=203.223.94.194;
 posting-account=vWIzlgoAAACC2chFthxRV-tOg_E-Eyba
References: <aa6b08350904241637n4aabb598q88b00b4d51895010@xxxxxxxxxxxxxx>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <53e63980-d1bd-498e-a8d2-caeaf4e00de2@xxxxxxxxxxxxxxxx>
Subject: Re:  delete duplicates takes too long
From: dhakaboy13@xxxxxxxxx
Injection-Date: Mon, 24 Dec 2012 02:45:11 +0000
Content-Type: text/plain; charset=ISO-8859-1
To: pgsql-general@xxxxxxxxxxxxxx


> 
> CREATE TABLE public.cdr_ama_stat ( 
> 
>     id           	int4 NOT NULL DEFAULT nextval('cdr_ama_stat_id_seq'::regclass),
> 
>     abonado_a    	varchar(30) NULL,
> 
>     abonado_b    	varchar(30) NULL,
> 
>     fecha_llamada	timestamp NULL,
> 
>     duracion     	int4 NULL,
> 
>     puerto_a     	varchar(4) NULL,
> 
>     puerto_b     	varchar(4) NULL,
> 
>     tipo_llamada 	char(1) NULL,
> 
>     processed    	int4 NULL,
> 
>     PRIMARY KEY(id)
> 
> )
> 
> GO
> 
> CREATE INDEX kpi_fecha_llamada
> 
>     ON public.cdr_ama_stat(fecha_llamada)
> 
> 
> 
> there should be unique values for abonado_a, abonado_b, fecha_llamada,
> duracion in every row, googling around i found how to delete duplicates
> in postgresonline site ,
> 
> so i run the following query (lets say i want to know how many duplicates exists for 2004-04-18, before delete them):
> 
> 
> 
> SELECT *  FROM cdr_ama_stat
> 
> WHERE id NOT IN
> 
> 	(SELECT 	MAX(dt.id)
> 
>         FROM   		cdr_ama_stat As dt
> 
>         WHERE dt.fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp + INTERVAL '1 day'
> 
>         GROUP BY 	dt.abonado_a, dt.abonado_b,dt.fecha_llamada,dt.duracion)
> 
> AND fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp + INTERVAL '1 day' 
> 
> 
> 
> my problem is that the query take forever, number of rows:
> 
> kpi=# select count(*) from cdr_ama_stat;
> 
>   count  
> 
> ---------
> 
>  5908065
> 
> (1 row)
> 
> 
> 
> this the explain result on the above query:
> 
> 
> 
> QUERY PLAN 
> 
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> 
> Index Scan using kpi_fecha_llamada on cdr_ama_stat
> (cost=115713.94..79528582.40 rows=140809 width=50) 
> 
> Index Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp
> without time zone) AND (fecha_llamada <= '2009-04-19
> 00:00:00'::timestamp without time zone)) 
> 
> Filter: (NOT (subplan)) 
> 
> SubPlan 
> 
> -> Materialize (cost=115713.94..116202.56 rows=28162 width=34) 
> 
> -> GroupAggregate (cost=110902.49..115478.78 rows=28162 width=34) 
> 
> -> Sort (cost=110902.49..111606.53 rows=281618 width=34) 
> 
> Sort Key: dt.abonado_a, dt.abonado_b, dt.fecha_llamada, dt.duracion 
> 
> -> Bitmap Heap Scan on cdr_ama_stat dt (cost=8580.03..70970.30
> rows=281618 width=34) 
> 
> Recheck Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp
> without time zone) AND (fecha_llamada <= '2009-04-19
> 00:00:00'::timestamp without time zone)) 
> 
> -> Bitmap Index Scan on kpi_fecha_llamada (cost=0.00..8509.62
> rows=281618 width=0) 
> 
> Index Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp
> without time zone) AND (fecha_llamada <= '2009-04-19
> 00:00:00'::timestamp without time zone)) 
> 
> 
> 
> 
> 
> am i doing something wrong?
> 
> I think several minutes should be ok, but not several hours as happens
> now, i do a bulk load (350k rows aprox) in that table every day, and
> yes, i did vacuum full analyze the database before posting this results.
> 
> Server is Dual Xeon 3.0 Ghx, 2Gb RAM
> 
> 
> 
> best regards

http://DuplicateFilesDeleter.com


-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general