James Parks <james.parks@xxxxxxxxxx> writes: > On Mon, Feb 29, 2016 at 5:22 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: >> The other explain shows a scan of "a" reading about 490k rows and >> returning 395 of them, so there's a factor of about 200 re-read here. >> I wonder if the planner should have inserted a materialize node to >> reduce that. >> >> However, I think the real problem is upstream of that: if that indexscan >> was estimated at 26163.20 units, how'd the mergejoin above it get costed >> at only 7850.13 units? The answer has to be that the planner thought the >> merge would stop before reading most of "a", as a result of limited range >> of b.a_id. It would be interesting to look into what the actual maximum >> b.a_id value is. > I've attached a pg_dump of a database that contains all of the data, in the > event you (or others) would like to look at it. The attachment is ~1.8MB > (gzipped), and you can replay the pg_dump file on a database that has just > been created with initdb. Thanks for sending the test data --- I got a chance to look at this finally. It looks like my first suspicion was right and the second one wrong: the planner is badly underestimating the amount of rescan required and thus not putting in a Materialize buffer node where needed. If I force a Materialize node to be put in, without changing any cost estimates (basically, set path->materialize_inner = 1 at the end of final_cost_mergejoin), then I get this: Sort (cost=7343.28..7343.62 rows=137 width=16) (actual time=218.342..232.817 rows=83427 loops=1) Sort Key: b.id Sort Method: external merge Disk: 2120kB -> Merge Join (cost=696.93..7338.41 rows=137 width=16) (actual time=18.627..136.549 rows=83427 loops=1) Merge Cond: (b.a_id = a.id) -> Index Scan using a_idx on b (cost=0.29..2708.59 rows=83427 width=16) (actual time=0.048..28.876 rows=83427 loops=1) -> Materialize (cost=0.42..24368.01 rows=805 width=8) (actual time=18.568..74.447 rows=83658 loops=1) -> Index Scan using a_pkey on a (cost=0.42..24366.00 rows=805 width=8) (actual time=18.560..66.186 rows=238 loops=1) Filter: (nonce = 64) Rows Removed by Filter: 87955 Execution time: 239.195 ms which is a pretty satisfactory result in terms of runtime, though still a bit slower than the hash alternative. Now, there are 490166 "a" rows, but we can see that the inner indexscan stopped after reading 87955+238 = 88193 of them. So there really is an early-stop effect and the planner seems to have gotten that about right. The problem is the rescan effect, which we can now quantify at 83658/238 or about 350:1. Despite which, stepping through final_cost_mergejoin finds that it estimates *zero* rescan effect. If I force the rescannedtuples estimate to be the correct value of 83658 - 238 = 83420, it properly decides that a materialize node ought to be put in; but that also increases its overall cost estimate for this mergejoin to 7600, which causes it to prefer doing the join in the other direction: Sort (cost=7343.28..7343.62 rows=137 width=16) (actual time=169.579..184.184 rows=83427 loops=1) Sort Key: b.id Sort Method: external merge Disk: 2120kB -> Merge Join (cost=696.93..7338.41 rows=137 width=16) (actual time=16.460..108.562 rows=83427 loops=1) Merge Cond: (a.id = b.a_id) -> Index Scan using a_pkey on a (cost=0.42..24366.00 rows=805 width=8) (actual time=16.442..63.197 rows=238 loops=1) Filter: (nonce = 64) Rows Removed by Filter: 87955 -> Index Scan using a_idx on b (cost=0.29..2708.59 rows=83427 width=16) (actual time=0.013..26.811 rows=83427 loops=1) Execution time: 190.441 ms which is an even more satisfactory outcome; the cheaper merge direction is properly estimated as cheaper, and this is actually a tad faster than the hash join for me. So the entire problem here is a bogus rescannedtuples estimate. The code comment about that estimate is * When there are equal merge keys in the outer relation, the mergejoin * must rescan any matching tuples in the inner relation. This means * re-fetching inner tuples; we have to estimate how often that happens. * * For regular inner and outer joins, the number of re-fetches can be * estimated approximately as size of merge join output minus size of * inner relation. Assume that the distinct key values are 1, 2, ..., and * denote the number of values of each key in the outer relation as m1, * m2, ...; in the inner relation, n1, n2, ... Then we have * * size of join = m1 * n1 + m2 * n2 + ... * * number of rescanned tuples = (m1 - 1) * n1 + (m2 - 1) * n2 + ... = m1 * * n1 + m2 * n2 + ... - (n1 + n2 + ...) = size of join - size of inner * relation * * This equation works correctly for outer tuples having no inner match * (nk = 0), but not for inner tuples having no outer match (mk = 0); we * are effectively subtracting those from the number of rescanned tuples, * when we should not. Can we do better without expensive selectivity * computations? Some investigation of the actual keys values says that indeed there are a whole lot of a.id values that have no match in b.a_id, so I think the comment at the end is telling us what the problem is. Is there a better way to make this estimate? regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance