Optimising outer joins in the presence of non-nullable references

Philip Lykke Carlsen <philip@xxxxxxxxx> · Mon, 24 May 2021 14:10:24 +0200

Hi list.

I have a question about the different plans produced by postgres for
an outer join versus an inner join.

(complete sql script attached)

Take these two tables:

    CREATE TABLE album
      ( id SERIAL PRIMARY KEY,
        title TEXT NOT NULL
      );
    CREATE INDEX album_title ON album (title);

    CREATE TABLE track
      ( id SERIAL PRIMARY KEY,
        album_id INT NOT NULL REFERENCES album(id),
        title TEXT NOT NULL
      );
    CREATE INDEX track_album ON track(album_id);
    CREATE INDEX track_title ON track(title);

where crucially `track` references `album(id)` with a `NOT NULL` reference.

Now, if we query both an inner and an outer join on `track` and
`album` (after having suitably filled-in data), we get very different
plans where only the inner join exploits indices.

That is:

    EXPLAIN ANALYZE
    SELECT t.id
    FROM track t
    INNER JOIN album a ON (t.album_id = a.id)
    ORDER BY a.title ASC
    LIMIT 10;

Produces this query plan:

                                                              QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.56..3.29 rows=10 width=36) (actual time=0.038..0.052
rows=10 loops=1)
   ->  Nested Loop  (cost=0.56..3606.93 rows=13200 width=36) (actual
time=0.036..0.046 rows=10 loops=1)
         ->  Index Scan using album_title on album a
(cost=0.28..113.23 rows=1397 width=36) (actual time=0.015..0.016
rows=1 loops=1)
         ->  Index Scan using track_album on track t  (cost=0.29..1.84
rows=66 width=8) (actual time=0.012..0.018 rows=10 loops=1)
               Index Cond: (album_id = a.id)
 Planning Time: 0.473 ms
 Execution Time: 0.096 ms
(7 rows)

While this:

    EXPLAIN ANALYZE
    SELECT t.id
    FROM track t
    LEFT JOIN album a ON (t.album_id = a.id)
    ORDER BY a.title ASC
    LIMIT 10;

Produces this query plan:

                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=604.43..604.45 rows=10 width=36) (actual
time=20.934..20.943 rows=10 loops=1)
   ->  Sort  (cost=604.43..637.43 rows=13200 width=36) (actual
time=20.932..20.934 rows=10 loops=1)
         Sort Key: a.title
         Sort Method: top-N heapsort  Memory: 27kB
         ->  Hash Left Join  (cost=42.43..319.18 rows=13200 width=36)
(actual time=1.082..12.333 rows=10000 loops=1)
               Hash Cond: (t.album_id = a.id)
               ->  Seq Scan on track t  (cost=0.00..242.00 rows=13200
width=8) (actual time=0.031..3.919 rows=10000 loops=1)
               ->  Hash  (cost=24.97..24.97 rows=1397 width=36)
(actual time=0.990..0.991 rows=1000 loops=1)
                     Buckets: 2048  Batches: 1  Memory Usage: 101kB
                     ->  Seq Scan on album a  (cost=0.00..24.97
rows=1397 width=36) (actual time=0.014..0.451 rows=1000 loops=1)
 Planning Time: 0.251 ms
 Execution Time: 20.999 ms
(12 rows)

My question then is, shouldn't the inner and outer join queries be
semantically equivalent when the columns we are joining on are
non-nullable foreign keys?
Is there some corner case I'm not considering?
Would it be a good addition to postgres if it could detect this and
produce a plan that exploits the indices?

(My root motivation for asking this question is this github issue:
https://github.com/hasura/graphql-engine/issues/5949)

Regards,
Philip
Attachment:
outer-join-indexes.sql

Description: application/sql