Wasteful nested loop join when there is `limit` in the query

WU Yan <4wuyan@xxxxxxxxx> · Mon, 17 Feb 2025 17:38:49 +1100

Hello everyone, I am still learning postgres planner and performance optimization, so please kindly point out if I missed something obvious.

I've noticed that postgres joins all rows in two tables, even though there's a `limit` in the query. It means lots of joined rows are discarded eventually, and the performance can be better if postgres planner can do `limit` before the `join`.

Here's my example. I am running postgres 17.3 in a container
```sh
docker run --rm -e POSTGRES_PASSWORD=mysecretpassword -d -p 5432:5432 postgres:17.3
```

```sql
create table department(
    id int primary key,
    name text);
create table employee(
    id int primary key,
    name text,
    department_id int);
insert into department values
    (1, 'sales'),
    (2, 'support'),
    (3, 'research'),
    (4, 'management'),
    (5, 'hr');
insert into employee values
    (101, 'alice', 1),
    (102, 'bob', 2),
    (103, 'ccc', 3),
    (104, 'ddd', 4),
    (105, 'eee', 999);
analyze department;
analyze employee;

set enable_mergejoin = off;
set enable_hashjoin = off;

explain analyze
select *
from employee left outer join department
    on employee.department_id = department.id
order by employee.name limit 2;
```

In this simple setup, I want to see the top 2 employees by name, and check their department info. Here's the query plan:
```
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2.49..2.49 rows=2 width=23) (actual time=0.037..0.038 rows=2 loops=1)
   ->  Sort  (cost=2.49..2.50 rows=5 width=23) (actual time=0.036..0.037 rows=2 loops=1)
         Sort Key: employee.name
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Nested Loop Left Join  (cost=0.00..2.44 rows=5 width=23) (actual time=0.014..0.021 rows=5 loops=1)
               Join Filter: (employee.department_id = department.id)
               Rows Removed by Join Filter: 11
               ->  Seq Scan on employee  (cost=0.00..1.05 rows=5 width=12) (actual time=0.006..0.007 rows=5 loops=1)
               ->  Materialize  (cost=0.00..1.07 rows=5 width=11) (actual time=0.001..0.001 rows=3 loops=5)
                     ->  Seq Scan on department  (cost=0.00..1.05 rows=5 width=11) (actual time=0.001..0.002 rows=5 loops=1)
 Planning Time: 0.248 ms
 Execution Time: 0.062 ms
(12 rows)
```

Note `loops=5` in the plan. My understanding is, all 5 rows in table employee are joined, although we only need 2. I think join before limit is necessary if we do not know whether join will create multiple rows or eliminate rows. However, in this setup, it's left outer join on a primary key, so we know one employee will produce exactly one row. Similar assumption can be made for `a inner join b on a.x = b.x` when a.x is a foreign key referencing b.x and b.x has a unique constraint.

At the moment, my workaround is to rewrite the query with CTE. I need to write `order by ... limit ...` twice. `loops=2` is observed.
```
postgres=# explain analyze
with t as (
    select * from employee order by employee.name limit 2
) select *
from t left outer join department on t.department_id = department.id
order by t.name limit 2;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.10..2.32 rows=2 width=23) (actual time=0.045..0.049 rows=2 loops=1)
   ->  Nested Loop Left Join  (cost=1.10..2.32 rows=2 width=23) (actual time=0.044..0.047 rows=2 loops=1)
         Join Filter: (employee.department_id = department.id)
         Rows Removed by Join Filter: 1
         ->  Limit  (cost=1.10..1.10 rows=2 width=12) (actual time=0.032..0.033 rows=2 loops=1)
               ->  Sort  (cost=1.10..1.11 rows=5 width=12) (actual time=0.032..0.033 rows=2 loops=1)
                     Sort Key: employee.name
                     Sort Method: top-N heapsort  Memory: 25kB
                     ->  Seq Scan on employee  (cost=0.00..1.05 rows=5 width=12) (actual time=0.015..0.016 rows=5 loops=1)
         ->  Materialize  (cost=0.00..1.07 rows=5 width=11) (actual time=0.004..0.004 rows=2 loops=2)
               ->  Seq Scan on department  (cost=0.00..1.05 rows=5 width=11) (actual time=0.004..0.005 rows=2 loops=1)
 Planning Time: 0.196 ms
 Execution Time: 0.103 ms
(13 rows)
```

I wonder whether there's a better way to influence the planner to push the `limit` down to the `employee` table first, or this optimization strategy is not yet in the planner. The demo is trivial, but in some real cases, it's possible to select top 100 from 1 million rows. Sorting 1 million rows is slow but necessary and acceptable, but wasteful nested loop join on millions of rows can make the performance much slower. Even worse, the wasteful join can happen multiple times when it's more than 2 tables joining.

Thank you

Best regards
Yan