Re: Reason of Slowness of query

Adarsh Sharma <adarsh.sharma@xxxxxxxxxx> · Wed, 23 Mar 2011 13:00:26 +0530

Thanks Chetan, here is the output of your updated query :

explain  select distinct(p.crawled_page_id) from page_content p
where
NOT EXISTS (select 1 from  clause2 c where c.source_id =
p.crawled_page_id);

                                      QUERY
PLAN                                       

---------------------------------------------------------------------------------------

 HashAggregate  (cost=1516749.47..1520576.06 rows=382659 width=8)

   ->  Hash Anti Join  (cost=1294152.41..1515791.80 rows=383071
width=8)

         Hash Cond: (p.crawled_page_id = c.source_id)

         ->  Seq Scan on page_content p  (cost=0.00..87132.17
rows=428817 width=8)

         ->  Hash  (cost=771182.96..771182.96 rows=31876196 width=4)

               ->  Seq Scan on clause2 c  (cost=0.00..771182.96
rows=31876196 width=4)

(6 rows)

And my explain analyze output is :

                                                      QUERY
PLAN                                                                 

--------------------------------------------------------------------------------------------------------------------------------------------

 HashAggregate  (cost=1516749.47..1520576.06 rows=382659 width=8)
(actual time=56666.181..56669.270 rows=72 loops=1)

   ->  Hash Anti Join  (cost=1294152.41..1515791.80 rows=383071
width=8) (actual time=45740.789..56665.816 rows=74 loops=1)

         Hash Cond: (p.crawled_page_id = c.source_id)

         ->  Seq Scan on page_content p  (cost=0.00..87132.17
rows=428817 width=8) (actual time=0.012..715.915 rows=428467 loops=1)

         ->  Hash  (cost=771182.96..771182.96 rows=31876196 width=4)
(actual time=45310.524..45310.524 rows=31853083 loops=1)

               ->  Seq Scan on clause2 c  (cost=0.00..771182.96
rows=31876196 width=4) (actual time=0.055..23408.884 rows=31853083
loops=1)

 Total runtime: 56687.660 ms

(7 rows)

But Is there is any option to tune it further and one more thing output
rows varies from 6 to 7.

Thanks & best Regards,

Adarsh Sharma

Chetan Suttraway wrote:

  On Wed, Mar 23, 2011 at 11:58 AM, Adarsh
Sharma <adarsh.sharma@xxxxxxxxxx>
wrote:

    Dear all,

I have 2 tables in my database name clause2( 4900 MB) &
page_content(1582 MB).

My table definations are as :

    page_content :-

    CREATE TABLE page_content

(

  content_id integer,

  wkb_geometry geometry,

  link_level integer,

  isprocessable integer,

  isvalid integer,

  isanalyzed integer,

  islocked integer,

  content_language character(10),

  url_id integer,

  publishing_date character(40),

  heading character(150),

  category character(150),

  crawled_page_url character(500),

  keywords character(500),

  dt_stamp timestamp with time zone,

  "content" character varying,

  crawled_page_id bigint,

  id integer

)

WITH (

  OIDS=FALSE

);

    Indexes on it :-

CREATE INDEX idx_page_id  ON page_content  USING btree 
(crawled_page_id);

CREATE INDEX idx_page_id_content   ON page_content  USING btree 
(crawled_page_id, content_language, publishing_date, isprocessable);

CREATE INDEX pgweb_idx  ON page_content   USING gin  
(to_tsvector('english'::regconfig, content::text));

    clause 2:-

    CREATE TABLE clause2

(

  id bigint NOT NULL DEFAULT nextval('clause_id_seq'::regclass),

  source_id integer,

  sentence_id integer,

  clause_id integer,

  tense character varying(30),

  clause text,

  CONSTRAINT pk_clause_demo_id PRIMARY KEY (id)

)WITH ( OIDS=FALSE);

    Indexes on it :

    CREATE INDEX idx_clause2_march10

  ON clause2

  USING btree

  (id, source_id);

    I perform a join query on it as :

     explain analyze select distinct(p.crawled_page_id) from
page_content p , clause2  c where p.crawled_page_id != c.source_id ;

    What it takes more than 1 hour to complete. As I issue the
explain
analyze command and cannot able to wait for output but I send my
explain output as :

                                             QUERY
PLAN                                               

--------------------------------------------------------------------------------------------------------

 Unique  (cost=927576.16..395122387390.13 rows=382659 width=8)

   ->  Nested Loop  (cost=927576.16..360949839832.15
rows=13669019023195 width=8)

         Join Filter: (p.crawled_page_id <> c.source_id)

         ->  Index Scan using idx_page_id on page_content p 
(cost=0.00..174214.02 rows=428817 width=8)

         ->  Materialize  (cost=927576.16..1370855.12 rows=31876196
width=4)

               ->  Seq Scan on clause2 c  (cost=0.00..771182.96
rows=31876196 width=4)

(6 rows)

Please guide me how to make the above query run faster as I am not able
to do that.

Thanks, Adarsh

Could you try just explaining the below query:

explain  select distinct(p.crawled_page_id) from page_content p where
NOT EXISTS (select 1 from  clause2 c where c.source_id =
p.crawled_page_id);

The idea here is to avoid directly using NOT operator.

Regards,

Chetan

-- 

Chetan Suttraway

  EnterpriseDB, The Enterprise
PostgreSQL company.