Convert IN sublink to join

<francisco.santos@xxxxxxxxxxxxxxxx> · Wed, 14 Dec 2005 13:27:21 -0000

Title: Dear Sir or Madam:

Dear Sir or Madam:

The function "convert_IN_to_join(Query *parse, SubLink *sublink)",
from file: <postgres-8.0.4 root>/src/backend/optimizer/plan/subselect.c,
is responsible for converting IN type sublinks to joins, whenever
appropriate.

The following lines of code, extracted from convert_IN_to_join, verify if the
subquery is correlated:

      /*

       * The sub-select must not refer to any Vars of the parent query.

       * (Vars of higher levels should be okay, though.)

       */

      if (contain_vars_of_level((Node *) subselect, 1))

              return NULL; 

By commenting this code region I was able to optimize several correlated
subqueries. Apparently, the rest of the PostgreSQL code is ready to handle the
"convert subquery to join" optimization. Is this test really necessary?

Please analyze the following example:

DDL:

CREATE TABLE "students" ( sid char(5) primary key, name char(20) not
null, age integer, email char(20) not null, unique( email ), avgrade float not
null );

CREATE TABLE "enrolled" ( sid char(5), cid char(5), grade real not
null, primary key(sid,cid), foreign key(sid) references students (sid) on
delete restrict  );

DML:

1) correlated IN subquery:

"Find all students who's grade in 'TFC' class is higher than their average
grade."

select a.sid, a.name from students a where a.sid IN ( select i.sid from
enrolled i where i.grade > a.avgrade AND i.cid = 'TFC');

QUERY PLAN:

 Seq Scan on students a  (cost=0.00..5763804.50 rows=5000 width=33)

   Filter: (subplan)

   SubPlan

     ->  Seq Scan on enrolled i  (cost=0.00..1144.00 rows=3473 width=9)

           Filter: ((grade > $0) AND (cid = 'TFC'::bpchar))

2) the same query after commenting out the above code region in
convert_IN_to_join:

QUERY PLAN:

 Hash Join  (cost=1050.24..1518.21 rows=693 width=33)

   Hash Cond: ("outer".sid = "inner".sid)

   Join Filter: ("inner".grade > "outer".avgrade)

   ->  Seq Scan on students a  (cost=0.00..367.00 rows=10000 width=41)

   ->  Hash  (cost=1045.04..1045.04 rows=2078 width=13)

         ->  HashAggregate  (cost=1045.04..1045.04 rows=2078 width=13)

               ->  Seq Scan on enrolled i  (cost=0.00..1019.00 rows=10417
width=13)

                     Filter: (cid = 'TFC'::bpchar)

3) Clearly, it is possible to extract the IN subquery from query 1 since the
outer attribute a.sid matches, at most once, with the inner tuple i.sid.
Although s.sid is not a primary key by itself, together with "i.cid =
'TFC'" conjunct, it forms a unique tuple. Here is an efficient alternative
to query 1:

select a.sid, a.name from students a, enrolled i where a.sid = i.sid AND  i.cid
= 'TFC' AND i.grade > a.avgrade;

QUERY PLAN:

 Hash Join  (cost=480.00..2366.86 rows=3473 width=33)

   Hash Cond: ("outer".sid = "inner".sid)

   Join Filter: ("outer".grade > "inner".avgrade)

   ->  Seq Scan on enrolled i  (cost=0.00..1019.00 rows=10417
width=13)

         Filter: (cid = 'TFC'::bpchar)

   ->  Hash  (cost=367.00..367.00 rows=10000 width=41)

         ->  Seq Scan on students a  (cost=0.00..367.00 rows=10000
width=41)

I have verified that both 2) and 3) return the exact same tuples, query 1)
never completed due to the highly inefficient execution plan.

Please help me with this issue.

Kind regards,

Francisco Santos