Thanks a lot Tom, as always :)
We generally do not have so many duplicates in production, so maybe this is an edge case but I am happy with the explanation and the code reference for the analysis.
I’ll also play with default statistic target to see what changes by increasing the value.
On Sun, 16 Dec 2018 at 5:52 AM Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Vijaykumar Jain <vjain@xxxxxxxxxxxxx> writes:
> I was just playing with exploring joins and plans i came across this
> create table t1(a int);
> create table t2(a int);
> insert into t1 select (x % 10) from generate_series(1, 100000) x;
> insert into t2 select (x % 100) from generate_series(1, 100000) x;
> ...
> select * from t1 join t2 using (a);
Hm. This is a fairly extreme case for mergejoining. In the first place,
because of the disparity in the key ranges (t1.a goes from 0..9, t2.a
from 0..99) the planner can figure out that a merge join can stop after
scanning only 10% of t2. That doesn't help much here, since we still
have to sort all of t2, but nonetheless the planner is going to take
that into account. In the second place, because you have so many
duplicate values, most rows in t1 will require "rescanning" 1000 rows
that were already read and joined to the previous row of t1 (assuming
t1 is on the left of the join; it's worse if t2 is on the left).
The planner estimates each of those situations properly, but it looks
to me like it is not handling the combination of both effects correctly.
In costsize.c we've got
/*
* The number of tuple comparisons needed is approximately number of outer
* rows plus number of inner rows plus number of rescanned tuples (can we
* refine this?). At each one, we need to evaluate the mergejoin quals.
*/
startup_cost += merge_qual_cost.startup;
startup_cost += merge_qual_cost.per_tuple *
(outer_skip_rows + inner_skip_rows * rescanratio);
run_cost += merge_qual_cost.per_tuple *
((outer_rows - outer_skip_rows) +
(inner_rows - inner_skip_rows) * rescanratio);
where outer_rows and inner_rows are the numbers of rows we're predicting
to actually read from each input, the xxx_skip_rows values are zero for
this example, and rescanratio was previously computed as
/* We'll inflate various costs this much to account for rescanning */
rescanratio = 1.0 + (rescannedtuples / inner_path_rows);
where inner_path_rows is the *total* size of the inner relation,
including rows that we're predicting won't get read because of the
stop-short effect.
As far as I can tell, that comment's claim about the number of tuple
comparisons needed is on-target ... but the code is computing a number
of tuple comparisons 10x less than that. The reason is that rescanratio
is wrong: it should be
rescanratio = 1.0 + (rescannedtuples / inner_rows);
instead, so that it's something that makes sense to multiply inner_rows
by. In the existing uses of rescanratio, one multiplies it by
inner_path_rows and needs to be changed to inner_rows to agree with
this definition, but the other uses are already consistent with this.
This doesn't make a significant difference if either rescannedtuples
is small, or inner_rows isn't much less than inner_path_rows. But
when neither is true, we can greatly underestimate the number of tuple
comparisons we'll have to do, as well as the number of re-fetches from
the inner plan node. I think in practice it doesn't matter that often,
because in such situations we'd usually not have picked a mergejoin
anyway. But in your example the buggy mergejoin cost estimate is about
10% less than the hashjoin cost estimate, so we go with mergejoin.
The attached proposed patch fixes this, raising the mergejoin cost
estimate to about 35% more than the hashjoin estimate, which seems
a lot closer to reality. It doesn't seem to change any results in
the regression tests, which I find unsurprising: there are cases
like this in the tests, but as I just said, they pick hashjoins
already.
Also interesting is that after this fix, the estimated costs of a
mergejoin for this example are about the same whether t1 or t2 is on
the left. I think that's right: t2-on-the-left has 10x more rescanning
to do per outer tuple, but it stops after scanning only 10% of the
outer relation, canceling that out.
I'm not sure whether to back-patch this. It's a pretty clear thinko,
but there's the question of whether we'd risk destabilizing plan
choices that are working OK in the real world.
regards, tom lane
Vijay