Search Postgresql Archives

Inefficient query plan for SELECT ... EXCEPT ...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello list,

I'm getting an inefficient query plan for a SELECT ... EXCEPT ... query,
where the left side is a very short table (even zero-length sometimes, but
also also rarely can be as long as 200K rows), and the right side is a
table with 10M UNIQUE NOT NULL rows:

\d test_datatags
                            Table "public.test_datatags"
     Column     |  Type   | Collation | Nullable |             Default
----------------+---------+-----------+----------+----------------------------------
 test_datatag_n | integer |           | not null | generated by default as identity
 test_datatag   | text    |           | not null |
Indexes:
    "test_datatags_pkey" PRIMARY KEY, btree (test_datatag_n)
    "test_datatags_test_datatag_key" UNIQUE CONSTRAINT, btree (test_datatag)


Follows a simplified and pathological case, that takes 1.7s even though
the left side is empty:


BEGIN;
CREATE TEMPORARY TABLE tmp_table(d1 TEXT) ON COMMIT DROP;
ANALYZE VERBOSE tmp_table ;
INFO:  analyzing "pg_temp_9.tmp_table"
INFO:  "tmp_table": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows
EXPLAIN (ANALYSE,VERBOSE,BUFFERS,SETTINGS)
            SELECT DISTINCT  d1
                FROM tmp_table
                WHERE d1 IS NOT NULL
            EXCEPT
                SELECT test_datatag FROM test_datatags;

                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 HashSetOp Except  (cost=0.00..299854.08 rows=1 width=36) (actual time=1726.470..1726.472 rows=0 loops=1)
   Output: "*SELECT* 1".d1, (0)
   Buffers: shared hit=1259 read=58800
   I/O Timings: shared/local read=77.713
   ->  Append  (cost=0.00..278054.53 rows=8719821 width=36) (actual time=3.754..1287.901 rows=8702840 loops=1)
         Buffers: shared hit=1259 read=58800
         I/O Timings: shared/local read=77.713
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..0.02 rows=1 width=36) (actual time=0.003..0.003 rows=0 loops=1)
               Output: "*SELECT* 1".d1, 0
               ->  HashAggregate  (cost=0.00..0.01 rows=1 width=32) (actual time=0.002..0.003 rows=0 loops=1)
                     Output: tmp_table.d1
                     Group Key: tmp_table.d1
                     Batches: 1  Memory Usage: 24kB
                     ->  Seq Scan on pg_temp.tmp_table  (cost=0.00..0.00 rows=1 width=32) (actual time=0.001..0.001 rows=0 loops=1)
                           Output: tmp_table.d1
                           Filter: (tmp_table.d1 IS NOT NULL)
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..234455.40 rows=8719820 width=26) (actual time=3.751..943.850 rows=8702840 loops=1)
               Output: "*SELECT* 2".test_datatag, 1
               Buffers: shared hit=1259 read=58800
               I/O Timings: shared/local read=77.713
               ->  Seq Scan on public.test_datatags  (cost=0.00..147257.20 rows=8719820 width=22) (actual time=3.747..515.420 rows=8702840 loops=1)
                     Output: test_datatags.test_datatag
                     Buffers: shared hit=1259 read=58800
                     I/O Timings: shared/local read=77.713
 Settings: effective_io_concurrency = '0', enable_partitionwise_aggregate = 'on', enable_partitionwise_join = 'on', hash_mem_multiplier = '1',
 maintenance_io_concurrency = '0', max_parallel_workers = '12', max_parallel_workers_per_gather = '0', temp_buffers = '64MB'
 Planning:
   Buffers: shared hit=2
 Planning Time: 0.055 ms
 JIT:
   Functions: 15
   Options: Inlining false, Optimization false, Expressions true,
Deforming true
   Timing: Generation 0.317 ms, Inlining 0.000 ms, Optimization 0.179 ms,
Emission 3.542 ms, Total 4.037 ms
 Execution Time: 1726.835 ms
(33 rows)


I'm wondering why the planner doesn't see that the left table is very small and follow a different path.
From an abstract computer science POV, I would

1. sort the left table  (the right one is already indexed)
2. "merge" the two tables, by walking them in-order in parallel and excluding the matches
3. stop when the left table is exhausted, which would happen very early.

Is this worth a bug report? I can file one if the issue is not known.
Or am I misunderstanding the implications of the SELECT-EXCEPT query?

In the meantime I have replaced the query with a LEFT OUTER JOIN which
performs much better, and I believe is equivalent. I find it less readable
than the query in question though. Plus, I have a bunch of SELECT-EXCEPT
queries (with smaller right-side tables) in my application that I would
hate to change them all to the ugliest equivalent. Under what conditions
would the above query plan perform well?

Thanks in advance,
Dimitris







[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux