Thanks Chetan, here is the output of your updated query :
explain select distinct(p.crawled_page_id) from page_content p
where
NOT EXISTS (select 1 from clause2 c where c.source_id =
p.crawled_page_id);
QUERY
PLAN
---------------------------------------------------------------------------------------
HashAggregate (cost=1516749.47..1520576.06 rows=382659 width=8)
-> Hash Anti Join (cost=1294152.41..1515791.80 rows=383071
width=8)
Hash Cond: (p.crawled_page_id = c.source_id)
-> Seq Scan on page_content p (cost=0.00..87132.17
rows=428817 width=8)
-> Hash (cost=771182.96..771182.96 rows=31876196 width=4)
-> Seq Scan on clause2 c (cost=0.00..771182.96
rows=31876196 width=4)
(6 rows)
And my explain analyze output is :
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1516749.47..1520576.06 rows=382659 width=8)
(actual time=56666.181..56669.270 rows=72 loops=1)
-> Hash Anti Join (cost=1294152.41..1515791.80 rows=383071
width=8) (actual time=45740.789..56665.816 rows=74 loops=1)
Hash Cond: (p.crawled_page_id = c.source_id)
-> Seq Scan on page_content p (cost=0.00..87132.17
rows=428817 width=8) (actual time=0.012..715.915 rows=428467 loops=1)
-> Hash (cost=771182.96..771182.96 rows=31876196 width=4)
(actual time=45310.524..45310.524 rows=31853083 loops=1)
-> Seq Scan on clause2 c (cost=0.00..771182.96
rows=31876196 width=4) (actual time=0.055..23408.884 rows=31853083
loops=1)
Total runtime: 56687.660 ms
(7 rows)
But Is there is any option to tune it further and one more thing output
rows varies from 6 to 7.
Thanks & best Regards,
Adarsh Sharma
Chetan Suttraway wrote:
On Wed, Mar 23, 2011 at 11:58 AM, Adarsh
Sharma <adarsh.sharma@xxxxxxxxxx>
wrote:
Dear all,
I have 2 tables in my database name clause2( 4900 MB) &
page_content(1582 MB).
My table definations are as :
page_content :-
CREATE TABLE page_content
(
content_id integer,
wkb_geometry geometry,
link_level integer,
isprocessable integer,
isvalid integer,
isanalyzed integer,
islocked integer,
content_language character(10),
url_id integer,
publishing_date character(40),
heading character(150),
category character(150),
crawled_page_url character(500),
keywords character(500),
dt_stamp timestamp with time zone,
"content" character varying,
crawled_page_id bigint,
id integer
)
WITH (
OIDS=FALSE
);
Indexes on it :-
CREATE INDEX idx_page_id ON page_content USING btree
(crawled_page_id);
CREATE INDEX idx_page_id_content ON page_content USING btree
(crawled_page_id, content_language, publishing_date, isprocessable);
CREATE INDEX pgweb_idx ON page_content USING gin
(to_tsvector('english'::regconfig, content::text));
clause 2:-
CREATE TABLE clause2
(
id bigint NOT NULL DEFAULT nextval('clause_id_seq'::regclass),
source_id integer,
sentence_id integer,
clause_id integer,
tense character varying(30),
clause text,
CONSTRAINT pk_clause_demo_id PRIMARY KEY (id)
)WITH ( OIDS=FALSE);
Indexes on it :
CREATE INDEX idx_clause2_march10
ON clause2
USING btree
(id, source_id);
I perform a join query on it as :
explain analyze select distinct(p.crawled_page_id) from
page_content p , clause2 c where p.crawled_page_id != c.source_id ;
What it takes more than 1 hour to complete. As I issue the
explain
analyze command and cannot able to wait for output but I send my
explain output as :
QUERY
PLAN
--------------------------------------------------------------------------------------------------------
Unique (cost=927576.16..395122387390.13 rows=382659 width=8)
-> Nested Loop (cost=927576.16..360949839832.15
rows=13669019023195 width=8)
Join Filter: (p.crawled_page_id <> c.source_id)
-> Index Scan using idx_page_id on page_content p
(cost=0.00..174214.02 rows=428817 width=8)
-> Materialize (cost=927576.16..1370855.12 rows=31876196
width=4)
-> Seq Scan on clause2 c (cost=0.00..771182.96
rows=31876196 width=4)
(6 rows)
Please guide me how to make the above query run faster as I am not able
to do that.
Thanks, Adarsh
Could you try just explaining the below query:
explain select distinct(p.crawled_page_id) from page_content p where
NOT EXISTS (select 1 from clause2 c where c.source_id =
p.crawled_page_id);
The idea here is to avoid directly using NOT operator.
Regards,
Chetan
--
Chetan Suttraway
EnterpriseDB, The Enterprise
PostgreSQL company.
|