Re: unnesesary sorting after Merge Full Join

Decibel! <decibel@xxxxxxxxxxx> · Sat, 23 Feb 2008 14:49:19 -0600

On Feb 21, 2008, at 4:08 AM, Alexey Nalbat wrote:
I'd like to use ORDER BY in any specified order and LIMIT, OFFSET  
for paging query results.
The query is FULL OUTER JOIN of two tables by field id. I think the  
results of Merge Full Join
to be ordered by some "combined id". And there is no need in extra  
Sort if I specify ORDER BY
that "combined id". But unfortunately it is not so.

Here is a simple example:
-- BEGIN
create table t1 as select generate_series(1,1000000,2) as id;
create table t2 as select generate_series(1,1000000,3) as id;

create index i1 on t1 ( id );
create index i2 on t2 ( id );

analyze t1;
analyze t2;

explain analyze
select id, t1.*, t2.*
from t1 natural full join t2
order by 1 limit 10 offset 10;

drop table t1;
drop table t2;
-- END

Postgresql chooses such plan:
 Limit  (cost=44080.12..44080.15 rows=10 width=8) (actual  
time=6724.850..6724.906 rows=10 loops=1)
   ->  Sort  (cost=44080.10..45330.10 rows=500000 width=8) (actual  
time=6724.806..6724.845 rows=20 loops=1)
         Sort Key: (COALESCE(t1.id, t2.id))
         Sort Method:  top-N heapsort  Memory: 25kB
         ->  Merge Full Join  (cost=0.00..30775.28 rows=500000  
width=8) (actual time=0.142..5237.289 rows=666667 loops=1)
               Merge Cond: (t1.id = t2.id)
               ->  Index Scan using i1 on t1  (cost=0.00..15212.30  
rows=500000 width=4) (actual time=0.079..1188.601 rows=500000 loops=1)
               ->  Index Scan using i2 on t2  (cost=0.00..10146.30  
rows=333334 width=4) (actual time=0.051..793.635 rows=333334 loops=1)

The desired plan is much faster:
 Limit  (cost=0.62..1.23 rows=10 width=8) (actual time=0.262..0.366  
rows=10 loops=1)
   ->  Merge Full Join  (cost=0.00..30775.28 rows=500000 width=8)  
(actual time=0.156..0.303 rows=20 loops=1)
         Merge Cond: (t1.id = t2.id)
         ->  Index Scan using i1 on t1  (cost=0.00..15212.30  
rows=500000 width=4) (actual time=0.088..0.120 rows=15 loops=1)
         ->  Index Scan using i2 on t2  (cost=0.00..10146.30  
rows=333334 width=4) (actual time=0.056..0.078 rows=11 loops=1)

I found comment in src/backend/optimizer/path/pathkeys.c:
* EXCEPTION: in a FULL or RIGHT join, we cannot treat the result as
* having the outer path's path keys, because null lefthand rows may be
* inserted at random points. It must be treated as unsorted.

How can I get rid of this sorting? Or could this behavior of Merge  
Full Join be improved?

Theoretically, this can be improved, but I suspect it would be non- 
trivial. I suspect that the problem is the planner doesn't realize  
that the join key could never be null, which is often not the case.  
Consider this example:

decibel=# create table t1 as select generate_series(1,1000000,2) as id1;
SELECT
decibel=# create table t2 as select generate_series(1,1000000,3) as id2;
SELECT

Create index, etc.

explain analyze
select id1, id2
from t1 full join t2 on (t1.id1=t2.id2)
order by 1 limit 10 offset 10;

Note that in this case you have to re-sort, because NULLs sort  
differently.

As a workaround, I suggest creating a table that contains all the IDs  
from t1 and t2. You could maintain this table via a trigger if you  
wanted. You could then quickly determine the exact IDs you wanted,  
and then join against the two real tables.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@xxxxxxxxxxx
Give your computer some brain candy! www.distributed.net Team #1828

<<attachment: smime.p7s>>