Re: Why hash join cost calculation need reduction

高健 <luckyjackgao@xxxxxxxxx> · Fri, 14 Jun 2013 13:16:39 +0800

Hello:

Thanks for replying!

I understand it a little more.

And I compared the following statements:

First:

postgres=# explain analyze select * from sales s inner join
customers c on s.cust_id = c.cust_id;

QUERY PLAN                                                   

------------------------------------------------------------------------------------------------------------------

 Hash Join  (cost=1.07..2.15 rows=3 width=84) (actual
time=0.017..0.019 rows=3 loops=1)

   Hash Cond:
(s.cust_id = c.cust_id)

   ->  Seq Scan on sales s  (cost=0.00..1.04 rows=4 width=42) (actual
time=0.004..0.004 rows=4 loops=1)

   ->  Hash 
(cost=1.03..1.03 rows=3 width=42) (actual time=0.004..0.004 rows=3
loops=1)

         Buckets:
1024  Batches: 1  Memory Usage: 1kB

         ->  Seq Scan on customers c  (cost=0.00..1.03 rows=3 width=42) (actual
time=0.001..0.001 rows=3 loops=1)

 Total runtime: 0.046
ms

(7 rows)

Second:

postgres=# explain analyze select * from sales s inner join
customers c on s.cust_id = c.cust_id and c.cust_id =2;

QUERY PLAN                                                

------------------------------------------------------------------------------------------------------------

 Nested Loop  (cost=0.00..2.10 rows=1 width=84) (actual
time=0.000..0.000 rows=1 loops=1)

   ->  Seq Scan on sales s  (cost=0.00..1.05 rows=1 width=42) (actual
time=0.000..0.000 rows=1 loops=1)

         Filter:
(cust_id = 2)

         Rows Removed
by Filter: 3

   ->  Seq Scan on customers c  (cost=0.00..1.04 rows=1 width=42) (actual
time=0.000..0.000 rows=1 loops=1)

         Filter:
(cust_id = 2)

         Rows Removed
by Filter: 2

 Total runtime: 0.000
ms

(8 rows)

Third:

postgres=# explain analyze select * from sales s inner join
customers c on s.cust_id = c.cust_id and c.cust_id <4;

QUERY PLAN                                                 

------------------------------------------------------------------------------------------------------------

 Nested Loop  (cost=0.00..2.13 rows=1 width=84) (actual
time=0.014..0.018 rows=3 loops=1)

   Join Filter:
(s.cust_id = c.cust_id)

   Rows Removed by
Join Filter: 9

   ->  Seq Scan on customers c  (cost=0.00..1.04 rows=1 width=42) (actual
time=0.007..0.007 rows=3 loops=1)

         Filter:
(cust_id < 4)

   ->  Seq Scan on sales s  (cost=0.00..1.04 rows=4 width=42) (actual
time=0.000..0.000 rows=4 loops=3)

 Total runtime: 0.038
ms

(7 rows)

postgres=#

The first sql statement and third sql statment really drive
the final_cost_hashjoin function to be called.

I think  For the above
third one,

 cost_qual_eval(&hash_qual_cost, hashclauses,
root)  is  for 
<s.cust_id = c.cust_id> 

And

cost_qual_eval(&qp_qual_cost,
path->jpath.joinrestrictinfo, root) is for <s.cust_id =
c.cust_id and c.cust_id <4>

I've  found the following
calling relation:

hash_inner_and_outer  à try_hashjoin_path  à
create_hashjoin_path à
final_cost_hashjoin

For the second sql statement , 

In the hash_inner_and_outer function, 

 the < if ( hashclauses) > condition is false,  So there is no chance to try a hashjoin path.

That is : 

When I use  the  where condition such as  <cust_id=2>, 

postgresql is clever enough to know it is better to
make  seqscan and filter ? 

2013/6/13 Tom Lane <tgl@xxxxxxxxxxxxx>

Stephen Frost <sfrost@xxxxxxxxxxx> writes:

> * 高健 (luckyjackgao@xxxxxxxxx) wrote:

>> Why the reduction  is needed here  for cost calculation?

>       cost_qual_eval(&hash_qual_cost, hashclauses, root);

> returns the costs for *just the quals which can be used for the

> hashjoin*, while

>       cost_qual_eval(&qp_qual_cost, path->jpath.joinrestrictinfo, root);

> returns the costs for *ALL the quals*

Right.  Note what it says in create_hashjoin_path:

 * 'restrict_clauses' are the RestrictInfo nodes to apply at the join

 ...

 * 'hashclauses' are the RestrictInfo nodes to use as hash clauses

 *        (this should be a subset of the restrict_clauses list)

So the two cost_qual_eval() calls are *both* counting the cost of the

hashclauses, and we have to undo that to get at just the cost of any

additional clauses beside the hash clauses.  See the comment about the

usage of qp_qual_cost further down:

    /*

     * For each tuple that gets through the hashjoin proper, we charge

     * cpu_tuple_cost plus the cost of evaluating additional restriction

     * clauses that are to be applied at the join.  (This is pessimistic since

     * not all of the quals may get evaluated at each tuple.)

     */

    startup_cost += qp_qual_cost.startup;

    cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;

    run_cost += cpu_per_tuple * hashjointuples;

            regards, tom lane