Vijaykumar Jain <vjain@xxxxxxxxxxxxx> writes: > I was just playing with exploring joins and plans i came across this > create table t1(a int); > create table t2(a int); > insert into t1 select (x % 10) from generate_series(1, 100000) x; > insert into t2 select (x % 100) from generate_series(1, 100000) x; > ... > select * from t1 join t2 using (a); Hm. This is a fairly extreme case for mergejoining. In the first place, because of the disparity in the key ranges (t1.a goes from 0..9, t2.a from 0..99) the planner can figure out that a merge join can stop after scanning only 10% of t2. That doesn't help much here, since we still have to sort all of t2, but nonetheless the planner is going to take that into account. In the second place, because you have so many duplicate values, most rows in t1 will require "rescanning" 1000 rows that were already read and joined to the previous row of t1 (assuming t1 is on the left of the join; it's worse if t2 is on the left). The planner estimates each of those situations properly, but it looks to me like it is not handling the combination of both effects correctly. In costsize.c we've got /* * The number of tuple comparisons needed is approximately number of outer * rows plus number of inner rows plus number of rescanned tuples (can we * refine this?). At each one, we need to evaluate the mergejoin quals. */ startup_cost += merge_qual_cost.startup; startup_cost += merge_qual_cost.per_tuple * (outer_skip_rows + inner_skip_rows * rescanratio); run_cost += merge_qual_cost.per_tuple * ((outer_rows - outer_skip_rows) + (inner_rows - inner_skip_rows) * rescanratio); where outer_rows and inner_rows are the numbers of rows we're predicting to actually read from each input, the xxx_skip_rows values are zero for this example, and rescanratio was previously computed as /* We'll inflate various costs this much to account for rescanning */ rescanratio = 1.0 + (rescannedtuples / inner_path_rows); where inner_path_rows is the *total* size of the inner relation, including rows that we're predicting won't get read because of the stop-short effect. As far as I can tell, that comment's claim about the number of tuple comparisons needed is on-target ... but the code is computing a number of tuple comparisons 10x less than that. The reason is that rescanratio is wrong: it should be rescanratio = 1.0 + (rescannedtuples / inner_rows); instead, so that it's something that makes sense to multiply inner_rows by. In the existing uses of rescanratio, one multiplies it by inner_path_rows and needs to be changed to inner_rows to agree with this definition, but the other uses are already consistent with this. This doesn't make a significant difference if either rescannedtuples is small, or inner_rows isn't much less than inner_path_rows. But when neither is true, we can greatly underestimate the number of tuple comparisons we'll have to do, as well as the number of re-fetches from the inner plan node. I think in practice it doesn't matter that often, because in such situations we'd usually not have picked a mergejoin anyway. But in your example the buggy mergejoin cost estimate is about 10% less than the hashjoin cost estimate, so we go with mergejoin. The attached proposed patch fixes this, raising the mergejoin cost estimate to about 35% more than the hashjoin estimate, which seems a lot closer to reality. It doesn't seem to change any results in the regression tests, which I find unsurprising: there are cases like this in the tests, but as I just said, they pick hashjoins already. Also interesting is that after this fix, the estimated costs of a mergejoin for this example are about the same whether t1 or t2 is on the left. I think that's right: t2-on-the-left has 10x more rescanning to do per outer tuple, but it stops after scanning only 10% of the outer relation, canceling that out. I'm not sure whether to back-patch this. It's a pretty clear thinko, but there's the question of whether we'd risk destabilizing plan choices that are working OK in the real world. regards, tom lane
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 7bf67a0..480fd25 100644 *** a/src/backend/optimizer/path/costsize.c --- b/src/backend/optimizer/path/costsize.c *************** final_cost_mergejoin(PlannerInfo *root, *** 2941,2948 **** if (rescannedtuples < 0) rescannedtuples = 0; } ! /* We'll inflate various costs this much to account for rescanning */ ! rescanratio = 1.0 + (rescannedtuples / inner_path_rows); /* * Decide whether we want to materialize the inner input to shield it from --- 2941,2953 ---- if (rescannedtuples < 0) rescannedtuples = 0; } ! ! /* ! * We'll inflate various costs this much to account for rescanning. Note ! * that this is to be multiplied by something involving inner_rows, or ! * another number related to the portion of the inner rel we'll scan. ! */ ! rescanratio = 1.0 + (rescannedtuples / inner_rows); /* * Decide whether we want to materialize the inner input to shield it from *************** final_cost_mergejoin(PlannerInfo *root, *** 2969,2975 **** * of the generated Material node. */ mat_inner_cost = inner_run_cost + ! cpu_operator_cost * inner_path_rows * rescanratio; /* * If we don't need mark/restore at all, we don't need materialization. --- 2974,2980 ---- * of the generated Material node. */ mat_inner_cost = inner_run_cost + ! cpu_operator_cost * inner_rows * rescanratio; /* * If we don't need mark/restore at all, we don't need materialization.